Wednesday, February 12, 2014

Ruminating on HBase datastore

While explaining the concept of HBase to my colleagues, I have observed that folks that do NOT have the baggage of traditional knowledge on RDBMS/DW are able to understand the fundamental concepts of HBase much faster than others. Application developers have been using data structures (collections) such as HashTable, HashMap for decades and they are better able to understand HBase concepts.

Part of the confusion is due to the terminology used in describing HBase. It is categorized as a column-oriented NOSQL store and is different from a row oriented traditional RDBMS. There are tons of articles that describe HBase as a data structure containing rows, column families, columns, cells, etc. In reality, HBase is nothing but a giant multidimensional sorted HashMap that is distributed across nodes. A HashMap consists of a set of key-value pairs. A multidimensional HashMap is one that has 'values' as other HashMaps.

Each row in HBase is actually a key/value map. This map can have any number of keys (known as columns), each of which has a value. This 'value' can again be a HashMap which has version history with timestamps. 

Each row in HBase can also store multiple key/value maps. In such cases, each key/value map is called a 'column family'. Each column-family is typically stored on different physical files or disks. This concept was introduced to support use-cases where you can have two sets of data for the same concept that are not generally accessed separately. 

The following article gives a good overview of the concepts we just discussed with JSON examples.
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Once we have a solid understanding of HBase concepts, it is useful to look at some common schema examples or use-cases where HBase datastore would be valuable. Link: http://hbase.apache.org/book/schema.casestudies.html

Looking at the use-cases in the above link, it is easy to understand that a lot of thought needs to go into designing a HBase schema. There would be multiple options/ways to design a HBase schema based on the primary use case of how data is going to be accessed. In other words, the design of your HBase schema would be dependent on your use-case. This is in total contrast to traditional data warehouses, where the applications accessing the data warehouse are free to define their own use-cases and the warehouse would support almost all use-cases within reasonable limits.

The following articles throw more light on the design constraints that should be considered while using HBase.
http://www.chrispwood.net/2013/09/hbase-schema-design-fundamentals.html
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf

For folks who do not have a programming background, but want to understand HBase in terms of table concepts, then the following YouTube video would be useful.
http://www.youtube.com/watch?v=IumVWII3fRQ