What is big data? According to Wikipedia, "Big data are datasets that grow so large that they become awkward to work with using on-hand database management tools." It is wonderful that we have technologies which can store large data sets and run complex algorithms on fast computers to produce views in the data sets. This approach, however, differs the traditional statistical approach like sampling. As a trained chemist, I was raised in the sampling approach. The number experiments I did in physical and analytical chemistry labs which relied heavily on the sampling approach.
This leads to the next question. Which way is better? This depends on several factors. For example:
- Data capturing methodology - If it is takes a few minutes to assess and capture data then sampling may be a way to go. If all of the data points cannot be captured during the process then sampling approach may be the way to go. For example the Census Bureau cannot do it's job if it took the big data approach to create its census. It would take a long time and there is no guarantee that they would interview every single individual process. The data points (people) are also changing by individuals dying and being born.
- Scalability - The Big Data approach is not a scalable approach. Yeah, the big data store vendors like Oracle, SAP, Micrsoft and Google will try to sell you more computing but capturing all of the data points and running algorithms can be a cost and resource prohibitive. I believe sampling is the way to go if you want to do a scale the model.
- Multiple views into a finite data set - If the data set is a finite number then I believe going the Big Data way is the right way since the computers will provide some unique views into the data.