Wednesday, February 12, 2014

Ruminating on Column Oriented Data Stores

There is some confusion in the market regarding column oriented databases (aka column-based). Due to the popularity of NoSQL stores such as HBase and Cassandra, many folks assume that that column oriented stores are only for unstructured big data. The fact is that column-oriented tables are available both in SQL (RDBMS) and NoSQL stores.

Let's look at SQL RDBMS first to understand the difference between row-based and column-based tables.
First and foremost, it is important to understand that whether the RDBMS is column or row-oriented is a physical storage implementation detail of the database. There is NO difference in the way we query against these databases using SQL or MDX (for multi-dimension querying on cubes).

In a row-based table, all the attributes (fields of a row) are stored side-by-side in a linear fashion. In a column-based table, all the the data in a particular column are stored side-by-side. An excellent overview of this concept with illustrative examples is given on Wikipedia. The following illustration should clear the concept easily.


For any query, the most expensive operations are hard disk seeks. As you can see from the above diagram, based on the type of table storage, certain queries would run faster. Column-oriented tables are more efficient when aggregates (or other calculations) need to be computed over many rows but only for a notably smaller subset of all columns of data. Hence column-oriented databases are popular in OLAP databases, where as row-oriented databases are popular in OLTP databases.
What's interesting to note is that the with the advent of in-memory databases, disk seek time is no longer a constraint. But large DWs contain peta-bytes of data and all data cannot be kept in memory.

A few years back, the world was divided between column-based databases and row-based databases. But slowly, almost all database vendors are giving the flexibility to store data in either row-based or column-based tables in the same database. For e.g. SAP HANA and Oracle 12c. Also all these databases are adding in-memory capabilities, thus boosting the performance of databases many-fold.

Now in case of NOSQL stores, the concept of column-oriented is more-or-less the same. But the difference is that stores such as HBase allow us to have different number of columns for each row. More information on the internal schema of HBase can be found here