i'm fluent in javascript as well as klingon.

Thursday, November 18, 2010

Get your head around BigData at the Internet Summit

#BigData session by IBM and Google http://bit.ly/9jsoia
Big Data session at Internet Summit 2010
November 18 2:30pm

Rod Smith, VP Emerging Technology and IBM Fellow

Data production is increasing at astonishing rate
Twitter - 7 terabytes of data each day
10 terabytes of data each day
CERN Hadron Super Collider produces 40 terabytes per second but looking for only a few GB of data insights
Web-based analytics puts power into consumer hands.

IBM Emerging tech team goes to customers to find new solutions for big challenges.
People usually start by saying "Google knows more about my data than I do, so I need an internal Google." Building a search engine is not the place to start. Look at your data.

Big data – new class of application on the horizon
  1. gather data
  2. extract items of interest
  3. explore insights

Rod Smith talks about BigSheets, Coremetrics and customers like NCSU at #isum10 #bigdata

An insight engine for enabling ad-hoc business insights for business purposes.
Structure data a little bit – but not too much.

When Google announced project called MapReduce, a new Apache project called Hadoop was started.
  • REST interfaces for plugging in Analytics pieces
  • Visualization components
  • REST API for customer choice of analytic engines
Customer Examples
Number 1 request: Twitter
This means Buyer Sentiment Analysis
What can I learn from social networks? (video)

Gathered: 350K Tweets
Extracted: How many are interested in iPhones? 50K show sentiment.
Analyzed: How many are interested in buying? 2300 people.
Think of this as a new Social CRM methodology
Time to analyze: 2 days

2nd Example: NC State
How to identify companies that would be interested in using NCSU OTT technology?
Reduced long manual process to 2 weeks

Marketplace Application Example: Seton Hall University
Use IBM Coremetrics to analyze relationship of mobile data use increase to iPad release by – visualized by heat maps.

... All others bring data. How do we handle #bigdata? http://bit.ly/9jsoia #isum10

Download Rod's PDF presentation from jStart.

Part 2 is...
Joe Gregorio, Google developer advocate (also REST and APP / Atom Pub Protocol guy)

So you have a lot of data, now what?

Top Five rules for Big Data
1. Save everything
2. Use simplest model first
3. get better by adding more data
4. Do the math
5. Be lazy

(Joe Gregorio's rapid speaking clip plus Lessig style slides make it difficult to live-blog his #bigdata talk. But it's good stuff.)

n-grams are important
there are 1B 5 word sequences out there
point: there are big data sets out there
there are 4 petabytes of data on Flickr (~1MB / photo * 4B photos)

Save everything: biggest data sets are an asset
simplest data model is ignorant about language and culture
it works because the data set is sufficiently large

Use the simplest model first
If you remember one thing...
Figure. Learning Curves for Confusion Set Disambiguation

Joe Gregorio from Google says if you remember 1 thing today.... #bigdata #isum10

Some tech
Storing: GFS, HadoopFS
Processing: MapReduce, Hadoop
Pregel (Graphical), Dremel (SQL like)

Do the math: know how long something is going to take before you kick it off
Hard drives when they get really full start to behave like Tape Drives

Be Lazy: Don't build infrastructure if you don't have to
Don't do analysis you don't have to: Google Trends for example
new: able to download CSV data from Google Trends

Note: both speakers say they'll post their slides online, I'll link or embed them here when I find 'em. In the meantime, h
ere are some references to explore:

No comments: