Monday, February 11, 2013

Ruminating on Visualization Techniques

The following link contains a good illustration of the various kinds of visualization techniques one can use to communicate ideas or clarify the business value behind the data.

http://www.visual-literacy.org/periodic_table/periodic_table.html

We are also experimenting with a new cool JS library called D3.js. Some pretty good visualization samples are available here.

This library can be used for basic charting and also can be used for impressive visualizations. We found this tutorial to be invaluable in understanding the basics of D3. 

Anscombe's Quartet

We often use statistical properties such as "average", "mean", "variance", "std. deviation" during performance measurement of applications/services. Recently a friend of mine pointed out that only relying on calculated stats can be quite misleading. He pointed me to the following article on Wikipedia.

http://en.wikipedia.org/wiki/Anscombe's_quartet

Anscombe's quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed.
By just looking at the data sets, its impossible to predict that the graphs would be so different. Only when we plot the data points on a graph, we can see the way the data behaves. Another testimony to the power of data visualization !

Useful Command Line Tools for Windows 7

Jotting down some Win 7 commands that I often use for quick access to information. One can directly type these commands in the 'Run' window.

  • msconfig: Get to know the programs that are configured to start during boot. Disable those programs that you are not interested in.
  • msinfo32: Quick summary of your system information. Give detailed info on hardware resources, OS details, system properties, etc.
  • control: Quick access to the Control Panel
  • eventvwr: Quick access to the Event Viewer
  • perfmon: Useful tool to monitor the performance of your system using performance counters.
  • resmon: Great tool to check out the resource utilization of CPU, Memory and Disk IO.
  • taskmgr: Quick access to Task Manager
  • cmd: Opens the command prompt
  • inetcpl.cpl : Opens the internet settings for proxy, security etc. 

Ruminating on Big Data

Came across an interesting infodeck on Big Data by Martin Fowler. There is a lot of hype around Big Data and there are tens of pundits defining Big Data in their own terms :) IMHO, right now we are at the "peak of inflated expectations" and "height of media infatuation" in the hype cycle.

But I agree with Martin on the fact that there is considerable fire behind the smoke. Once the hype dies down, folks would realize that we don't need another fancy term, but actually need to rethink about the basic principles of data-management.

There are 3 fundamental changes that would drive us to look beyond our current understanding around Data Management.
  1. Volume of Data: Today the volume of data is so huge, that traditional data management techniques of creating a centralized database system is no longer feasible. Grid based distributed databases are going to become more and more common.
  2. Speed at which Data is growing: Due to Web 2.0, explosion in electronic commerce, Social Media, etc. the rate at which data (mostly user generated content) is growing is unprecedented in the history of mankind.  According to Eric Schmidt (Google CEO), every two days now we create as much information as we did from the dawn of civilization up until  2003. Walmart is clocking 1 million transactions per hour and Facebook has 40 billion photos !!! This image would give you an idea on the amount of Big Data generated during the 2012 Olympics. 
  3. Different types of data: We no longer have the liberty to assume that all valuable data would be available to us in a structured format - well defined using some schema. There is going to be a huge volume of unstructured data that needs to be exploited. For e.g. emails, application logs, web click stream analysis, messaging events, etc. 
These 3 challenges of data are also popularly called as the 3 Vs of Big Data (volume of data, velocity of data and variety of data). To tackle these challenges, Martin urges us to focus on the following 3 aspects:
  1. Extraction of Data: Data is going to come from a lot of structured and unstructured sources. We need new skills to harvest and collate data from multiple sources. The fundamental challenge would be to understand how valuable some data could be? How do we discover such sources of data?
  2. Interpretation of Data: Ability to separate the wheat from the chaff. What data is pure noise? How to differentiate between signal and noise? How to avoid probabilistic illusions?
  3. Visualization of Data: Usage of modern visualization techniques that would make the data more interactive and dynamic. Visualization can be simple with good usability in mind. 
As this blog entry puts it in words - "Data is the new oil ! Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value."

NoSQL databases are also gaining popularity. Application architects would need to consider polyglot persistence for datasets having different characteristics. For e.g. columnar data stores (aggregate oriented), graph databases, key-value stores, etc.