
Big Data session at Internet Summit 2010
November 18 2:30pm
Rod Smith, VP Emerging Technology and IBM Fellow
Data production is increasing at astonishing rate
Twitter - 7 terabytes of data each day
10 terabytes of data each day
CERN Hadron Super Collider produces 40 terabytes per second but looking for only a few GB of data insights
Web-based analytics puts power into consumer hands.
IBM Emerging tech team goes to customers to find new solutions for big challenges.
People usually start by saying "Google knows more about my data than I do, so I need an internal Google." Building a search engine is not the place to start. Look at your data.
Big data – new class of application on the horizon
- gather data
- extract items of interest
- explore insights
BigSheets
An insight engine for enabling ad-hoc business insights for business purposes.
Structure data a little bit – but not too much.
When Google announced project called MapReduce, a new Apache project called Hadoop was started.
- REST interfaces for plugging in Analytics pieces
- Visualization components
- REST API for customer choice of analytic engines
Customer Examples
Number 1 request: Twitter
This means Buyer Sentiment Analysis
What can I learn from social networks? (video)
Gathered: 350K Tweets
Extracted: How many are interested in iPhones? 50K show sentiment.
Analyzed: How many are interested in buying? 2300 people.
Think of this as a new Social CRM methodology
Time to analyze: 2 days
2nd Example: NC State
How to identify companies that would be interested in using NCSU OTT technology?
Reduced long manual process to 2 weeks
Marketplace Application Example: Seton Hall University
Use IBM Coremetrics to analyze relationship of mobile data use increase to iPad release by – visualized by heat maps.
Download Rod's PDF presentation from jStart.
Part 2 is...
Joe Gregorio, Google developer advocate (also REST and APP / Atom Pub Protocol guy)
So you have a lot of data, now what?
Top Five rules for Big Data
1. Save everything
2. Use simplest model first
3. get better by adding more data
4. Do the math
5. Be lazy
(Joe Gregorio's rapid speaking clip plus Lessig style slides make it difficult to live-blog his #bigdata talk. But it's good stuff.)
n-grams are important
there are 1B 5 word sequences out there
point: there are big data sets out there
there are 4 petabytes of data on Flickr (~1MB / photo * 4B photos)
Save everything: biggest data sets are an asset
simplest data model is ignorant about language and culture
it works because the data set is sufficiently large
Use the simplest model first
If you remember one thing...
Figure. Learning Curves for Confusion Set Disambiguation
Some tech
Storing: GFS, HadoopFS
Processing: MapReduce, Hadoop
Pregel (Graphical), Dremel (SQL like)
Do the math: know how long something is going to take before you kick it off
Hard drives when they get really full start to behave like Tape Drives
Be Lazy: Don't build infrastructure if you don't have to
Don't do analysis you don't have to: Google Trends for example
new: able to download CSV data from Google Trends
Note: both speakers say they'll post their slides online, I'll link or embed them here when I find 'em. In the meantime, h
ere are some references to explore:
- Update: download Rod's presentation in PDF format
- Watch the YouTube video of the Twitter sentiment analysis example
- Follow the jStart team blog
- DeveloperWorks Article: How to solve cloud-related Big Data problems with MapReduce
- BigSheets project from IBM jStart team
- Try the IBM distribution of Apache Hadoop
- Check out open source project Apache Hadoop which implements MapReduce among other things
- A MapReduce paper from Google Research
- Follow All Things Hadoop on twitter




0 comments:
Post a Comment