Tech Talk: Big Data

Showing posts with label Big Data. Show all posts

Monday, May 30, 2016

Ruminating on IoT datastores

The most popular data-store choice for storing a high volume of IoT sensor data are NoSQL time-series databases.

The following link contains a good list of NoSQL time-series databases that can be used in an IoT project. We have worked with both OpenTSDB and KairosDB and found both of them to be enterprise grade.

http://www.erol.si/2015/01/the-complete-list-of-all-timeseries-databases-for-your-iot-project/

http://www.erol.si/2015/01/why-i-switched-from-opentsdb-to-kairosdb/

Wednesday, August 05, 2015

Ruminating on Data Lake

Anyone contemplating to understand a Data Lake should peruse the wonderful article by Martin Fowler on the topic - http://martinfowler.com/bliki/DataLake.html

Jotting down important points from the article -

Traditional data warehouse (data marts) have a fixed schema - it could be a star schema or a snowflake schema. But having a fixed schema imposes many restrictions for data analysis. A Data Lake is essentially schema-less.
Data warehouses also typically cleanse the incoming data and improve the data quality. They also aggregate data for faster reporting. In contrast, a Data Lake stores raw data from source systems. It is up-to the data scientist to extract the data and make sense of it.
We still need Data Marts - Because the data in a data lake is raw, you need a lot of skill to make any sense of it. You have relatively few people who work in the data lake, as they uncover generally useful views of data in the lake, they can create a number of data marts each of which has a specific model for a single bounded context.A larger number of downstream users can then treat these lake-shore marts as an authoritative source for that context.

Friday, June 12, 2015

Implementing sliding window aggregations in Apache Storm

My team was working on implementing CEP (Complex Event Processing) capabilities using Apache Storm. We evaluated multiple options for doing so - one option was using a lightweight in-process CEP engine like Esper within a Storm Bolt.

But there was another option of manually implementing CEP-like aggregations (over a sliding window) using Java code. The following links show us how to do so.

http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/

Rolling Count Bolt on Github

While the above code would help in satisfying certain scenarios, it would not provide the flexibility of a CEP engine. We need to understand that CEP engines like (Tibco BE, Esper, StreamInsights) are fundamentally different from Apache Storm; which is more of a highly distributed stream computing platform.

A CEP engine would provide you with SQL like declarative queries and OOTB high level operators like time window, temporal patterns, etc. This brings down the complexity of writing temporal queries and aggregates. CEP engines can also detect patterns in events. But most CEP engines do not support a distributed architecture.

Hence it makes sense to combine CEP with Apache Storm - for e.g. embedding Esper within a Storm Bolt. The following links would serve as good reference -

http://stackoverflow.com/questions/29025886/esper-ha-v-s-esper-storm
http://stackoverflow.com/questions/9164785/how-to-scale-out-with-esper/9776881#9776881

Monday, March 17, 2014

Ruminating on Distributed Logging

Of late, we have been experimenting with different frameworks available for distributed logging. I recollect that a decade back, I had written my own rudimentary distributed logging solution :)

To better appreciate the benefits of distributed log collection, it's important to visualize logs as streams and not files, as explained in this article.
The most promising frameworks we have experimented with are:

Logstash: Logstash combined with ElasticSearch and Kibana gives us a cool OOTB solution. Also Logstash is developed on the Java platform and was very easy to setup and start running.
Fluentd: Another cool framework for distributed logging. A good comparison between Logstash and Fluentd is available here.
Splunk: The most popular commercial tool for log management and data analytics.
GrayLog: A new kid on the block. Uses ElasticSearch. Need to keep a watch on this.
Flume: Flume's main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows.
Scribe: Scribe is written in C++ and uses Thrift for the protocol encoding. This project was released as open-source by Facebook.

Wednesday, February 12, 2014

Ruminating on Column Oriented Data Stores

There is some confusion in the market regarding column oriented databases (aka column-based). Due to the popularity of NoSQL stores such as HBase and Cassandra, many folks assume that that column oriented stores are only for unstructured big data. The fact is that column-oriented tables are available both in SQL (RDBMS) and NoSQL stores.

Let's look at SQL RDBMS first to understand the difference between row-based and column-based tables.
First and foremost, it is important to understand that whether the RDBMS is column or row-oriented is a physical storage implementation detail of the database. There is NO difference in the way we query against these databases using SQL or MDX (for multi-dimension querying on cubes).

In a row-based table, all the attributes (fields of a row) are stored side-by-side in a linear fashion. In a column-based table, all the the data in a particular column are stored side-by-side. An excellent overview of this concept with illustrative examples is given on Wikipedia. The following illustration should clear the concept easily.

For any query, the most expensive operations are hard disk seeks. As you can see from the above diagram, based on the type of table storage, certain queries would run faster. Column-oriented tables are more efficient when aggregates (or other calculations) need to be computed over many rows but only for a notably smaller subset of all columns of data. Hence column-oriented databases are popular in OLAP databases, where as row-oriented databases are popular in OLTP databases.
What's interesting to note is that the with the advent of in-memory databases, disk seek time is no longer a constraint. But large DWs contain peta-bytes of data and all data cannot be kept in memory.

A few years back, the world was divided between column-based databases and row-based databases. But slowly, almost all database vendors are giving the flexibility to store data in either row-based or column-based tables in the same database. For e.g. SAP HANA and Oracle 12c. Also all these databases are adding in-memory capabilities, thus boosting the performance of databases many-fold.

Now in case of NOSQL stores, the concept of column-oriented is more-or-less the same. But the difference is that stores such as HBase allow us to have different number of columns for each row. More information on the internal schema of HBase can be found here.

Ruminating on BSON vs JSON

MongoDB uses BSON as the default format to store documents in collections. BSON is a binary-encoded serialization of JSON-like documents. But what are the advantages of BSON over the ubiquitous JSON format?

In terms of space (memory footprint), BSON is generally more efficient than JSON, but it need not always be so. For example, integers are stored as 32 (or 64) bit integers, so they don't need to be parsed to and from text. This uses more space than JSON for small integers.

The primary goal of BSON is to enable very fast traversability i.e faster processing. BSON adds extra information to documents, e.g. length prefixes, that make it easy and fast to traverse. Due to this the BSON parser can parse through the documents at blinding speed.

Thus BSON sacrifices space efficiency for processing efficiency (a.k.a traversability). This is a fair trade-off for MongoDB where speed is the primary concern.

Ruminating on HBase datastore

While explaining the concept of HBase to my colleagues, I have observed that folks that do NOT have the baggage of traditional knowledge on RDBMS/DW are able to understand the fundamental concepts of HBase much faster than others. Application developers have been using data structures (collections) such as HashTable, HashMap for decades and they are better able to understand HBase concepts.

Part of the confusion is due to the terminology used in describing HBase. It is categorized as a column-oriented NOSQL store and is different from a row oriented traditional RDBMS. There are tons of articles that describe HBase as a data structure containing rows, column families, columns, cells, etc. In reality, HBase is nothing but a giant multidimensional sorted HashMap that is distributed across nodes. A HashMap consists of a set of key-value pairs. A multidimensional HashMap is one that has 'values' as other HashMaps.

Each row in HBase is actually a key/value map. This map can have any number of keys (known as columns), each of which has a value. This 'value' can again be a HashMap which has version history with timestamps.

Each row in HBase can also store multiple key/value maps. In such cases, each key/value map is called a 'column family'. Each column-family is typically stored on different physical files or disks. This concept was introduced to support use-cases where you can have two sets of data for the same concept that are not generally accessed separately.

The following article gives a good overview of the concepts we just discussed with JSON examples.
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Once we have a solid understanding of HBase concepts, it is useful to look at some common schema examples or use-cases where HBase datastore would be valuable. Link: http://hbase.apache.org/book/schema.casestudies.html

Looking at the use-cases in the above link, it is easy to understand that a lot of thought needs to go into designing a HBase schema. There would be multiple options/ways to design a HBase schema based on the primary use case of how data is going to be accessed. In other words, the design of your HBase schema would be dependent on your use-case. This is in total contrast to traditional data warehouses, where the applications accessing the data warehouse are free to define their own use-cases and the warehouse would support almost all use-cases within reasonable limits.

The following articles throw more light on the design constraints that should be considered while using HBase.
http://www.chrispwood.net/2013/09/hbase-schema-design-fundamentals.html
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf

For folks who do not have a programming background, but want to understand HBase in terms of table concepts, then the following YouTube video would be useful.
http://www.youtube.com/watch?v=IumVWII3fRQ

Thursday, September 26, 2013

Ruminating on MongoDB concurrency

In my previous post, I had discussed about the transaction support in MongoDB. Since MongoDB implements a ReadersWriter lock at the database level, I was a bit concerned that a writer lock for a long query may block all read operations for the time the query is running.

For e.g. If we fire a mongoDB query that updates 10,000 documents that would take 5 mins. Would all reads be blocked for the 5 mins till all the records are udpated? If yes, then that would be disastrous.

Fortunately this does not happen, due to lock yielding as stated on the MongoDB site.
Snippet from the site:
"Write operations that affect multiple documents (i.e. update() with the multi parameter,) will yield periodically to allow read operations during these long write operations. Similarly, long running read locks will yield periodically to ensure that write operations have the opportunity to complete."

Read locks are shared, but "read locks" block write locks from being acquired. Write locks prevent both other writes and reads. But MongoDB operations yield periodically to keep other threads waiting for locks from starving. An interesting blog post that shows stats on the performance of MongoDB locks is available here.

Thursday, September 19, 2013

Transactions in MongoDB

In my previous blog post, we had gone through some of the advantages of MongoDB in terms of schema flexibility and performance. The Metlife case study showed how we can quickly create a customer hub using MongoDB that supports flexible dynamic schemas.

Another added advantage of using MongoDB is that you don't have to worry about ORM tools, as there is no object-relational impedance mismatch. Also you don't have to worry about creating an application cache, as MongoDB be default uses all available memory for its working set.

But what about transactions? Any OLTP application would need full support for ACID to ensure reliability and consistency of data. The following articles shed good light on the transaction support in MongoDB.

http://docs.mongodb.org/manual/faq/fundamentals/
http://css.dzone.com/articles/how-acid-mongodb
http://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

MongoDB only supports "Atomicity" at the document level. It's important to remember that we can have nested documents and MongoDB would support atomicity across the nested documents. But if we need multi-object transaction support, then MongoDB is not a good fit.

Also if your application needs to "join" objects frequently, then MongoDB is also not suitable in that respect. For e.g. loading reference data (static data) from master tables with the transaction data.
MongoDB locks (Readers-Writer lock) are at the database level; i.e. entire database gets locked during a write operation. This can result in lock contention when you have a large number of write operations.

Looking at the pros-n-cons of MongoDB, IMHO it is best suited for heavy-read operations type of application. For e.g. a consolidated high performance read-only customer hub, a data store for content management systems, product catalogs in e-commerce systems, etc.

Wednesday, September 04, 2013

NoSQL for Customer Hub MDM

The following article on informationweek is an interesting read on the use of MongoDB NoSQL for building a customer MDM solution.

http://www.informationweek.com/software/information-management/metlife-uses-nosql-for-customer-service/240154741

MongoDB being a document oriented NoSQL database has its core strength in maintaining flexible schemas and storing data as JSON or BSON objects. Lets look at the pros and cons of using MongoDB as a MDM solution.

One of the fundamental challenges faced is creating a customer hub is the aggregation of disparate data from a variety of different sources. For e.g. a customer could have bought a number of products from an insurance firm. Using a traditional RDBMS would entail complexities of joining the table records and fulfilling all the referential constraints of the data. Also each insurance product may have different fields and dimensions. Should we create a table for each product type? In MongoDB, you can store all the policies of the customer in one JSON object. You can store different types of policy for each customer with full flexibility and maintain a natural hierarchy (parent-child) of relationships.

Another problem that Insurance firms face is that of legacy policy records. Certain insurance products such as Annuity have a long life period,but a lot of regulations and business needs change over the years and your old policy records may not have all the fields that are captured in new policy records. How do you handle such cases? Having a strict schema would not help and hence a solution like MongoDB offers the necessary flexibility to store spare data.

MongoDB also has an edge in terms of low TCO for scalability and performance. Its auto-sharding capabilities enable massive horizontal scalability. It also supports OOTB memory-mapped files that is of tremendous help with the prominence of 64-bit computing and tons of available RAM.

On the negative side, I am a bit concerned about the integrity of data in the above solution. Since there is no referential integrity, are we 100% confident on the accuracy of data? We would still need to use data profiling, data cleansing and data matching tools to find out unique customers and remove duplicates.

Metlife is using this customer hub only for agents and has not exposed this data to the customers as there are concerns about data integrity and accuracy. But what if we need to enable the customer to self-service all his policies from a single window on the organizations portal ? We cannot show invalid data to the customer.

Also from a skills perspective, MongoDB needs specialized resources.Its easy to use and develop, but for performance tuning and monitoring you need niche skills.

Monday, February 11, 2013

Ruminating on Big Data

Came across an interesting infodeck on Big Data by Martin Fowler. There is a lot of hype around Big Data and there are tens of pundits defining Big Data in their own terms :) IMHO, right now we are at the "peak of inflated expectations" and "height of media infatuation" in the hype cycle.

But I agree with Martin on the fact that there is considerable fire behind the smoke. Once the hype dies down, folks would realize that we don't need another fancy term, but actually need to rethink about the basic principles of data-management.

There are 3 fundamental changes that would drive us to look beyond our current understanding around Data Management.

Volume of Data: Today the volume of data is so huge, that traditional data management techniques of creating a centralized database system is no longer feasible. Grid based distributed databases are going to become more and more common.
Speed at which Data is growing: Due to Web 2.0, explosion in electronic commerce, Social Media, etc. the rate at which data (mostly user generated content) is growing is unprecedented in the history of mankind. According to Eric Schmidt (Google CEO), every two days now we create as much information as we did from the dawn of civilization up until 2003. Walmart is clocking 1 million transactions per hour and Facebook has 40 billion photos !!! This image would give you an idea on the amount of Big Data generated during the 2012 Olympics.
Different types of data: We no longer have the liberty to assume that all valuable data would be available to us in a structured format - well defined using some schema. There is going to be a huge volume of unstructured data that needs to be exploited. For e.g. emails, application logs, web click stream analysis, messaging events, etc.

These 3 challenges of data are also popularly called as the 3 Vs of Big Data (volume of data, velocity of data and variety of data). To tackle these challenges, Martin urges us to focus on the following 3 aspects:

Extraction of Data: Data is going to come from a lot of structured and unstructured sources. We need new skills to harvest and collate data from multiple sources. The fundamental challenge would be to understand how valuable some data could be? How do we discover such sources of data?
Interpretation of Data: Ability to separate the wheat from the chaff. What data is pure noise? How to differentiate between signal and noise? How to avoid probabilistic illusions?
Visualization of Data: Usage of modern visualization techniques that would make the data more interactive and dynamic. Visualization can be simple with good usability in mind.

As this blog entry puts it in words - "Data is the new oil ! Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value."

NoSQL databases are also gaining popularity. Application architects would need to consider polyglot persistence for datasets having different characteristics. For e.g. columnar data stores (aggregate oriented), graph databases, key-value stores, etc.

Tech Talk