Wednesday, February 19, 2014

MongoDB support for geo-location data

It was interesting to learn that FourSquare uses MongoDB to store all its geo-spatial data. I was also enlightening to see JSON standards for expressing GPS coordinates and other geo-spatial data. The following links would give quick information about GeoJSON and TopoJSON.

http://en.wikipedia.org/wiki/GeoJSON
http://en.wikipedia.org/wiki/TopoJSON
http://bost.ocks.org/mike/topology/
https://github.com/mbostock/topojson/blob/master/README.md

MongoDB supports the GeoJSON format and allows us to easily build location aware applications. For e.g. Using MongoDB geospatial query operators, you can -
  • Query for locations contained entirely within a specified polygon.
  • Query for locations that intersect with a specified geometry. We can supply a point, line, or polygon, and any document that intersects with the supplied geometry will be returned.
  • Query for the points nearest to another point.
As you can see these queries make it very easy to find documents (JSON objects) that are near a given point or documents that lie within a given polygon.

Ruminating on SSL handshake using Netty

I was again amazed at the simplicity of using Netty for HTTPS (SSL) authentication. The following sample examples are a good starting point for anyone interested in writing a HTTP server supporting SSL.

http://netty.io/5.0/xref/io/netty/example/http/snoop/package-summary.html

Also adding support for 2-way SSL authentication (aka mutual authentication) is also very simple. Here are some hints on how to get this done.

http://stackoverflow.com/questions/9573894/set-up-netty-with-2-way-ssl-handsake-client-and-server-certificate
http://maxrohde.com/2013/09/07/setting-up-ssl-with-netty/

Tuesday, February 18, 2014

Ruminating on IoT Security

The Internet of Things (IoT) is going to be the next big investment area for many organizations as the value of real-time data keeps on increasing in this hyper competitive world.

Cisco CEO states that IoT is going to be a 19 trillion $ opportunity. Coca Cola has blocked 16 million mac addresses that would be utilized for its smart vending machines. In the Healthcare space, increasing survival rates have resulted in an aging population. With age comes chronic illness and to address this, Payers and Providers are investing in home-care appliances to monitor patient data - heart/pulse rate, temperature, glucose level, etc.

The next couple of years would see a plethora of smart devices connected to the internet and this presents a very challenging security problem. Already poorly secured smart devices have been hacked and compromised. Fridges have been used to send spam messages. Smart TVs are spying on viewers home network.

One of the fundamental non-technical challenge is that manufactures of IoT devices have very little incentive to invest in patching old sensors/devices for security. The supply chain starts from the MCU (micro-controller) manufacturer to the OEM and these folks are busy releasing new versions of their products and do not have the time and energy to patch their old products for security. A very good article describing this is available here.

On the technology front, the challenge is that sensor devices have limited resources for implementing industry standard encryption techniques. A typical sensor (MCU - Micro Controller Unit) would  just have a processor of 32MHz and 256KB memory.  Also most sensor systems are proprietary and closed, making it difficult to patch security updates in a open way.

There is no easy solution to the above problem. The only hope is that the newer MCUs developed for IoT devices have enough processing power to support digital certificates. Using Digital certificates enables us to satisfy all facets of enterprise security - a) Authentication b)Authorization c) Integrity d)Confidentiality e)Non-Repudiation.
Verizon has a comprehensive suite of products for enabling IoT security using PKI/digital certificate technologies. 

Monday, February 17, 2014

How are hash collisions handled in a HashMap/HashTable?

In my previous post, I had explained how HashMap is the basic data-structure used in many popular big data stores including Hadoop HBase. But how is a HashMap actually implemented as?

The following links give a very good explanation of the internal working of a HashMap.
http://java.dzone.com/articles/hashmap-internal
http://howtodoinjava.com/2012/10/09/how-hashmap-works-in-java/

Whenever a key/value pair is added to the HashMap data structure, the key is hashed. The hash of the key determines in what 'bucket' or 'array index' the value would be stored at.
So what happens if there is a hash collision? The answer is - Each bucket is actually implemented as a LinkedList. The values are stored in a LinkedList. So for keys with same hash, a linear search is done in the linkedList.

HashSet vs HashMap

Today, one of my developers was using a HashMap to store a white-list of IP addresses. He was storing the IP address as both the key and value.
I told him about the HashSet class that can also be used to store a white-list of elements and similar to the HashMap; a HashSet also provides a time-complexity of 1 for get()/contains() operations.

We were a bit intrigued on how a HashSet actually works; so we looked into the source code. Much to our surprise, a HashSet actually uses a HashMap internally. For every item we add to the set, it adds it as a key to the HashMap and adds a dummy object as a value, as given below.

// Dummy value to associate with an Object in the backing Map
96     private static final Object PRESENT = new Object();

In theory, a Set and Map are separate concepts. A set cannot contain duplicates and is unordered, whereas a Map is a key/value pair collection.

Besides the HashSet, you also have 2 more classes or Set implementations that are useful. The first being TreeSet that sorts the elements as you store them in the collection. Second is the LinkedHashSet that maintains the insertion order. 

Wednesday, February 12, 2014

Ruminating on Star Schema

Sharing a good YouTube video on the basic concepts of a Star Schema. Definitely worth a perusal for anyone wanting a primer on DW schemas.



The data is the dimension tables can also change, albeit less frequently. For e.g. a state changes it's name. A customer changes his last name, etc. These are known as 'slowly changing dimensions'. To support slowly changing dimensions, we would need to add timestamp columns to the dimension table. A good video explaining these concepts is available below:


Ruminating on Column Oriented Data Stores

There is some confusion in the market regarding column oriented databases (aka column-based). Due to the popularity of NoSQL stores such as HBase and Cassandra, many folks assume that that column oriented stores are only for unstructured big data. The fact is that column-oriented tables are available both in SQL (RDBMS) and NoSQL stores.

Let's look at SQL RDBMS first to understand the difference between row-based and column-based tables.
First and foremost, it is important to understand that whether the RDBMS is column or row-oriented is a physical storage implementation detail of the database. There is NO difference in the way we query against these databases using SQL or MDX (for multi-dimension querying on cubes).

In a row-based table, all the attributes (fields of a row) are stored side-by-side in a linear fashion. In a column-based table, all the the data in a particular column are stored side-by-side. An excellent overview of this concept with illustrative examples is given on Wikipedia. The following illustration should clear the concept easily.


For any query, the most expensive operations are hard disk seeks. As you can see from the above diagram, based on the type of table storage, certain queries would run faster. Column-oriented tables are more efficient when aggregates (or other calculations) need to be computed over many rows but only for a notably smaller subset of all columns of data. Hence column-oriented databases are popular in OLAP databases, where as row-oriented databases are popular in OLTP databases.
What's interesting to note is that the with the advent of in-memory databases, disk seek time is no longer a constraint. But large DWs contain peta-bytes of data and all data cannot be kept in memory.

A few years back, the world was divided between column-based databases and row-based databases. But slowly, almost all database vendors are giving the flexibility to store data in either row-based or column-based tables in the same database. For e.g. SAP HANA and Oracle 12c. Also all these databases are adding in-memory capabilities, thus boosting the performance of databases many-fold.

Now in case of NOSQL stores, the concept of column-oriented is more-or-less the same. But the difference is that stores such as HBase allow us to have different number of columns for each row. More information on the internal schema of HBase can be found here

Ruminating on BSON vs JSON

MongoDB uses BSON  as the default format to store documents in collections. BSON is a binary-encoded serialization of JSON-like documents. But what are the advantages of BSON over the ubiquitous JSON format?

In terms of space (memory footprint), BSON is generally more efficient than JSON, but it need not always be so. For example, integers are stored as 32 (or 64) bit integers, so they don't need to be parsed to and from text. This uses more space than JSON for small integers.

The primary goal of BSON is to enable very fast traversability i.e faster processing. BSON adds extra information to documents, e.g. length prefixes, that make it easy and fast to traverse. Due to this the BSON parser can parse through the documents at blinding speed.

Thus BSON sacrifices space efficiency for processing efficiency (a.k.a traversability). This is a fair trade-off for MongoDB where speed is the primary concern. 

Ruminating on HBase datastore

While explaining the concept of HBase to my colleagues, I have observed that folks that do NOT have the baggage of traditional knowledge on RDBMS/DW are able to understand the fundamental concepts of HBase much faster than others. Application developers have been using data structures (collections) such as HashTable, HashMap for decades and they are better able to understand HBase concepts.

Part of the confusion is due to the terminology used in describing HBase. It is categorized as a column-oriented NOSQL store and is different from a row oriented traditional RDBMS. There are tons of articles that describe HBase as a data structure containing rows, column families, columns, cells, etc. In reality, HBase is nothing but a giant multidimensional sorted HashMap that is distributed across nodes. A HashMap consists of a set of key-value pairs. A multidimensional HashMap is one that has 'values' as other HashMaps.

Each row in HBase is actually a key/value map. This map can have any number of keys (known as columns), each of which has a value. This 'value' can again be a HashMap which has version history with timestamps. 

Each row in HBase can also store multiple key/value maps. In such cases, each key/value map is called a 'column family'. Each column-family is typically stored on different physical files or disks. This concept was introduced to support use-cases where you can have two sets of data for the same concept that are not generally accessed separately. 

The following article gives a good overview of the concepts we just discussed with JSON examples.
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Once we have a solid understanding of HBase concepts, it is useful to look at some common schema examples or use-cases where HBase datastore would be valuable. Link: http://hbase.apache.org/book/schema.casestudies.html

Looking at the use-cases in the above link, it is easy to understand that a lot of thought needs to go into designing a HBase schema. There would be multiple options/ways to design a HBase schema based on the primary use case of how data is going to be accessed. In other words, the design of your HBase schema would be dependent on your use-case. This is in total contrast to traditional data warehouses, where the applications accessing the data warehouse are free to define their own use-cases and the warehouse would support almost all use-cases within reasonable limits.

The following articles throw more light on the design constraints that should be considered while using HBase.
http://www.chrispwood.net/2013/09/hbase-schema-design-fundamentals.html
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf

For folks who do not have a programming background, but want to understand HBase in terms of table concepts, then the following YouTube video would be useful.
http://www.youtube.com/watch?v=IumVWII3fRQ

Tuesday, February 11, 2014

Examples of Real Time Analytics in Healthcare

The Healthcare industry has always been striving for reducing the cost of care and improving the quality of care. To do this, Payers are focusing more on prevention and wellness of members.

Digitization of clinical information and real-time decision making are important IT capabilities that are required today.  Given below are some examples of real time analytics in Healthcare.

  1. Hospital acquired infections are very dangerous for premature infants. Monitors can detect patterns in infected premature babies up to 24 hrs in advance before any symptoms are shown. This real time data can be captured and run through a real-time analytics engine to identify such cases and ensure that adequate treatment is given ASAP. 
  2. Use real time event processing to act as an early warning system, based on historical patterns or trends. For e.g. If few members are exhibiting behavior patterns of prior patients who relapsed into critical condition, then we can plan a targeted intervention for these members.  

Ruminating on Decision Trees

Decision trees are tree-like structures that can be used for decision making, classification of data, etc.
The following simple example (on the IBM SPSS Modeler Infocenter Site) shows a decision tree for making a car purchase.

Another example of a decision tree that can be used for classification is shown below. These diagrams are taken from the article available at - www.cse.msu.edu/~cse802/DecisionTrees.pdf‎



Any tree with a branching factor of 2 (only 2 leafs) is called as a "binary decision tree". Any tree with a variety of branching factors can be represented in an equivalent binary tree. For e.g. the below binary tree will evaluate to the same result as the first tree.


It is easy to see that such decision tree models can help us in segmentation. For e.g. segmentation of patients into high-risk and low-risk categories; high-risk credit vs. low risk credit; etc.
An excellent whitepaper on Decision Trees by SAS is available here.

Decision trees can also be used in predictive modeling - this is known as Decision Tree Learning of Decision Tree Induction. Other names for such tree models are classification trees or regression trees; aka Classification And Regression Tree (CART).
Essentially "Decision Tree Learning" is a data mining technique using which a decision tree is constructed by slicing and dicing the data using statistical algorithms. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like segments.
For e.g. On Wikipedia, there is a good example of a decision tree that was constructed by looking at the historic data of titanic survivors.
Decision Tree constructed through Data Mining of Titanic passengers.
Once such a decision tree model has been created, it can be exported as a standard PMML file. This PMML file can then be used in a real time scoring engine such as JPMML.

There is another open source project called as 'OpenScoring' that uses JPMML behind the scenes and provides us with a REST API to score data against our model. A simple example (with probability prediction mode) for identifying a flower based on attributes is illustrated here: https://github.com/jpmml/openscoring 

Decision Trees can also be modeled in Rule Engines. IBM iLog BRMS suite (WODM) supports the modeling of rules as a Decision Tree. 

When to use a Rule Engine?

The following article on JessRules.com is a good read before we jump on using a RuleEngine for each and every problem.


Earlier, I had also written another blog-post that lists down the simple steps one should take to understand what kind of data (logic) should be put in a rule engine. 

Friday, February 07, 2014

Ruminating on Netty Performance

Recently we had again used Netty for building a Event Server from grounds-up and we were amazed at the performance of this amazing library. The secret sauce of Netty is ofcourse the implementation of the Reactor Pattern using Java NIO. More information about the Reactive design paradigm can be found @ http://www.reactivemanifesto.org/

The following links bear testimony to the extreme performance of Netty.

http://netty.io/testimonials

http://www.infoq.com/news/2013/11/netty4-twitter (5x performance improvement)

http://yahooeng.tumblr.com/post/64758709722/making-storm-fly-with-netty (Netty used in Yahoo Storm)

Wednesday, February 05, 2014

Ruminating on Apple Passbook

The Apple 'Passbook' is an iOS application that allows users to store coupons, boarding passes, event tickets, gift cards (or any other card for that matter) on their phone. Hence it functions as a digital wallet.
Apple defines passes as - "Passes are a digital representation of information that might otherwise be printed on small pieces of paper or plastic. They let users take an action in the physical world, in the same way as boarding passes, membership cards, and coupons."

The Passbook application intelligently pops-up the pass (2D barcodes) on the 'locked screen' at the right time/place - i.e. either triggered by place using GPS tracking or triggered by time.

It's common sense to realize that the success of this app would largely depend on the number of partners who agree to publish their passes/cards in a format that is compatible with Apple Passbook. Many gift card companies have started adopting Passbook and are sending their coupons as Passbook attachments.

So what does a Passbook pass in digital format contain? The pass file is essentially a zip file with the *.pkpass extension. The zip file contains meta-data as JSON files and images as PNG files.
This format is an open format and hence any merchant or organization can adopt this standard and allow their customers to use the digital format of their coupon/pass/gift certificate/gift card, etc.