Monday, February 11, 2013

Ruminating on Big Data

Came across an interesting infodeck on Big Data by Martin Fowler. There is a lot of hype around Big Data and there are tens of pundits defining Big Data in their own terms :) IMHO, right now we are at the "peak of inflated expectations" and "height of media infatuation" in the hype cycle.

But I agree with Martin on the fact that there is considerable fire behind the smoke. Once the hype dies down, folks would realize that we don't need another fancy term, but actually need to rethink about the basic principles of data-management.

There are 3 fundamental changes that would drive us to look beyond our current understanding around Data Management.
  1. Volume of Data: Today the volume of data is so huge, that traditional data management techniques of creating a centralized database system is no longer feasible. Grid based distributed databases are going to become more and more common.
  2. Speed at which Data is growing: Due to Web 2.0, explosion in electronic commerce, Social Media, etc. the rate at which data (mostly user generated content) is growing is unprecedented in the history of mankind.  According to Eric Schmidt (Google CEO), every two days now we create as much information as we did from the dawn of civilization up until  2003. Walmart is clocking 1 million transactions per hour and Facebook has 40 billion photos !!! This image would give you an idea on the amount of Big Data generated during the 2012 Olympics. 
  3. Different types of data: We no longer have the liberty to assume that all valuable data would be available to us in a structured format - well defined using some schema. There is going to be a huge volume of unstructured data that needs to be exploited. For e.g. emails, application logs, web click stream analysis, messaging events, etc. 
These 3 challenges of data are also popularly called as the 3 Vs of Big Data (volume of data, velocity of data and variety of data). To tackle these challenges, Martin urges us to focus on the following 3 aspects:
  1. Extraction of Data: Data is going to come from a lot of structured and unstructured sources. We need new skills to harvest and collate data from multiple sources. The fundamental challenge would be to understand how valuable some data could be? How do we discover such sources of data?
  2. Interpretation of Data: Ability to separate the wheat from the chaff. What data is pure noise? How to differentiate between signal and noise? How to avoid probabilistic illusions?
  3. Visualization of Data: Usage of modern visualization techniques that would make the data more interactive and dynamic. Visualization can be simple with good usability in mind. 
As this blog entry puts it in words - "Data is the new oil ! Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value."

NoSQL databases are also gaining popularity. Application architects would need to consider polyglot persistence for datasets having different characteristics. For e.g. columnar data stores (aggregate oriented), graph databases, key-value stores, etc.