Wednesday, June 29, 2011

Big data and the sampling approach

Big data: The next frontier for innovation, competition, and productivityAfter my two hour commute from home, I stopped at Giant in Urbana, MD and bought flowers and chocolate for my wonderful wife.  During this process, I scanned my Giant card and go my rebate of couple of dollars.  While I was walking back to my car, I remembered one of the workshop I attended at last year's Gartner Symposium ITXPO.  The workshop was about visual analytics.  During the workshop I met an enterprise architect from a loyalty card company.  The company produces business intelligence by ingesting grocery stores chain's loyalty card data.  I then thought about the issue of big data.  This is hottest topics right now in the realm of information technology.

What is big data?  According to Wikipedia, "Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools."  It is wonderful that we have technologies which can store large data sets and run complex algorithms on fast computers to produce views in the data sets.  This approach, however, differs the traditional statistical approach like sampling.  As a trained chemist, I was raised in the sampling approach.  The number experiments I did in physical and analytical chemistry labs which relied heavily on the sampling approach.

This leads to the next question.  Which way is better?  This depends on several factors.  For example:
  • Data capturing methodology - If it is takes a few minutes to assess and capture data then sampling may be a way to go.  If all of the data points cannot be captured during the process then sampling approach may be the way to go.  For example the Census Bureau cannot do it's job if it took the big data approach to create its census.  It would take a long time and there is no guarantee that they would interview every single individual process.  The data points (people) are also changing by individuals dying and being born. 
  • Scalability - The Big Data approach is not a scalable approach. Yeah, the big data store vendors like Oracle, SAP, Micrsoft and Google will try to sell you more computing but capturing all of the data points and running algorithms can be a cost and resource prohibitive.  I believe sampling is the way to go if you want to do a scale the model.
  • Multiple views into a finite data set - If the data set is a finite number then I believe going the Big Data way is the right way since the computers will provide some unique views into the data. 
If anyone out there is working on both approaches then please comment on this entry.

    No comments: