Thursday, September 11, 2014

Monitoring TOMEE using VisualVM

A few years back, we moved to Jboss from Tomcat for our production servers, because there was no viable enterprise support for Tomcat.

Today, we have viable options such as support from Tomitribe.

The below article on Tomitribe gives a good overview of setting up VisualVM for monitoring Tomcat.

http://www.tomitribe.com/blog/2014/07/monitoring-an-apache-tomee-service-on-windows-with-visualvm/

Default tools in the JDK

Found the below article worth a perusal. We get so used to using sophisticated tools that we forget there are things we can do with a bare JDK :)

http://zeroturnaround.com/rebellabs/the-6-built-in-jdk-tools-the-average-developer-should-learn-to-use-more/

Monday, September 01, 2014

Does Digital Transformation need 'Skunk Works' kind of environment?

Skunk Works is a term that originated during WWII and is the official alias for Lockheed Martin’s Advanced Development Programs (ADP).

Today Skunk Works is used to describe a small group of people who work on advanced technology projects. This group is typically given a high degree of autonomy and unhampered by bureaucracy. The primary driver of setting up a Skunk Works team is to develop something quickly with minimal management constraints.
The term also refers to technology projects developed in semi-secrecy, such as Google X Lab.or the 50 people team established by Steve Jobs to develop the Macintosh computer.

For any organization embarking on a Digital Transformation journey, it would be worthwhile to build such as Skunk Works team that can innovate quickly and bring an idea to a required threshold of technology readiness. I have seen so many ideas die under the shackles of bureaucracy and long processes. Having a skunk works team operate like a start-up within your organization can do wonders in leap-frogging your competition in the digital age.

Monday, August 25, 2014

Ruminating on Showrooming and Webrooming in the Digital Age

When e-Commerce giants such as Amazon took the retail industry by storm, there was a lot of FUD on showrooming. As a digital native, even I indulged in showrooming before heading out to my favourite e-commerce site to buy the product online.

But a recent study conducted in US has found that many folks also engage in reverse showrooming (aka webrooming). In reverse showrooming," or "webrooming," consumers go online to research products, but then actually go to a bricks-and-mortar store to complete their purchase.

The following link on Business-Insider throws more details on this phenomenon.
http://www.businessinsider.in/Reverse-Showrooming-Bricks-And-Mortar-Retailers-Fight-Back/articleshow/30411064.cms

This report came as a surprise to me and I would assume that retailers are happy about this trend :)
Retailers are also trying out innovative techniques to capitalize on this trend. Some of them include deploying knowledgeable sales staff that educate the customer and create a superior in-store customer experience. BLE technology enabled beacons push personalized offers to the customer mobile app while he is in the store. m-Wallets would enable contact-less and hassle-free payments at POS.

Retailers are also embracing BOPiS (Buy Online Pick Up In Store) ! This greatly reduces the logistics/shipping costs, as the existing transportation network is used for delivery.

Popular e-Commerce software vendors such as Hybris have also started catering to this market and have an in-store solution for retailers. http://www.hybris.com/en/products/instore

Friday, August 22, 2014

A good comparison of BLE and Classic Bluetooth

The following link gives a good overview of the differences between BLE (Bluetooth low energy) and classic bluetooth. Definitely worth a perusal.

The fundamental reason why BLE is becoming so popular in beacons is the extremely Low Power Consumption of BLE devices. Its low power consumption makes it possible to power a small device with a tiny coin cell battery for 5–10 years !

http://www.medicalelectronicsdesign.com/article/bluetooth-low-energy-vs-classic-bluetooth-choose-best-wireless-technology-your-application

Tuesday, August 12, 2014

How does Facebook protect its users from malicious URLs?

The following post gives a good overview of the various techniques (such as link shim) used by Facebook to protect its users from malicious websites - whose links would be embedded in posts.

https://www.facebook.com/notes/facebook-security/link-shim-protecting-the-people-who-use-facebook-from-malicious-urls/10150492832835766

Facebook has its internal blacklist of malicious links and also queries external partners such as McAfee, Google, Web of Trust, and Websense.  When FB detects that a URL is malicious, it displays an interstitial page before the browser actually requests the suspicious page. This protects the user, who now has to make a conscious decision as to whether he wants to proceed to the malicious page.

BTW, if you have not already installed the 'Web of Trust' browser plugin for your browser, do so immediately :)

Another interesting point was the fact that it is more secure to run a check at click time than at display time. If one relied on display-time filtering alone, we would not be able to retroactively block any malicious URLs - lying in an email or an old page.

Wednesday, July 09, 2014

Collection of free books from Microsoft

Eric Lingman has provided links to a large collection of free Microsoft books on a variety of topics on his blog post (link below).

http://blogs.msdn.com/b/mssmallbiz/archive/2014/07/07/largest-collection-of-free-microsoft-ebooks-ever-including-windows-8-1-windows-8-windows-7-office-2013-office-365-office-2010-sharepoint-2013-dynamics-crm-powershell-exchange-server-lync-2013-system-center-azure-cloud-sql.aspx

Some of the books that I found interesting were on Azure Cloud Design Patterns, SharePoint, Office 365, etc.

Tuesday, June 03, 2014

Categorization of applications in IT portfolio

During any portfolio rationalization exercise, we categorize applications based on various facets, as explained in one of my old posts here.

Interestingly, Gartner has defined three application categories, or "layers," to distinguish application types and help organizations develop more appropriate strategies for each of them.

Snippets from the Gartner news site (http://www.gartner.com/newsroom/id/1923014)

Systems of Record — Established packaged applications or legacy homegrown systems that support core transaction processing and manage the organization's critical master data. The rate of change is low, because the processes are well-established and common to most organizations, and often are subject to regulatory requirements.
Systems of Differentiation — Applications that enable unique company processes or industry-specific capabilities. They have a medium life cycle (one to three years), but need to be reconfigured frequently to accommodate changing business practices or customer requirements.
Systems of Innovation — New applications that are built on an ad hoc basis to address new business requirements or opportunities. These are typically short life cycle projects (zero to 12 months) using departmental or outside resources and consumer-grade technologies.


Ruminating on RTB, GTB and TTB

The IT industry loves TLA's (three letter acronyms) ! Recently a customer was explaining their IT budget distribution to us in terms of 'Run the business investments', 'Grow the business investments' and 'Transform the business investments'.

RTB investments are for 'keeping the lights on'. This budget is required to keep the operations running that support the core business functions. In RTB investments, the core focus is on efficiency and performance optimization. RTB-type applications are increasingly being outsourced to a IT vendor under a managed services contract.

GTB investments are used to support organic growth and increased customer demand. For e.g. adding capacity to an existing data center, bolstering your DR site, virtualization for quick provisioning, etc.

TTB investments are for creating new products or introducing new services; i.e. making changes to the current business model. For e.g. Apple entered the music industry with iTunes, IBM moved to services from hardware, etc. 

Monday, May 26, 2014

Ruminating on HIPAA compliance

I was a bit confused on the intricacies of what entities are covered under HIPAA. The following article helped me clear a few cobwebs and also helped me appreciate the fact that it's impossible to protect all healthcare information all the time.

http://www.worldprivacyforum.org/2013/09/hipaaguide9-2/

The crux of the HIPAA regulation is that your information is only protected by a 'covered entity'. HIPAA defines 3 types of covered entities - Payer, Provider and Clearing House.

Posting interesting snippets from the site:

Health information that is protected when held by a covered entity. It may have no privacy protections when the information is held by a someone who is not a covered entity. In other words, health privacy protections depend on who has the information and not on the nature of the information. 

It is important to understand that HIPAA does not automatically cover all health care providers. A free health clinic may not be subject to HIPAA because it doesn’t bill anyone. A doctor who charges every patient $25 cash and does not submit a bill to any insurance company may not be covered by HIPAA. A first aid room at your workplace may or may not be covered by HIPAA.

Most school health records are not subject to HIPAA. Instead, school records (private schools are a major exception) are usually covered by another federal privacy law, the Family Educational Rights and Privacy Act (FERPA). 

The list of unregulated health record keepers is shockingly long. These include gyms, medical and fitness apps and devices not offered by covered entities, health websites not offered by covered entities, Internet search engines, life and casualty insurers, Medical Information Bureau, employers (but this one is complicated), worker’s compensation insurers, banks, credit bureaus, credit card companies. many health researchers, National Institutes of Health, cosmetic medicine services, transit companies, hunting and fishing license agencies, occupational health clinics, fitness clubs, home testing laboratories, massage therapists, nutritional counselors, alternative medicine practitioners, disease advocacy groups, marketers of non-prescription health products and foods, and some urgent care facilities

Friday, May 23, 2014

Ruminating on Rate Limiting

As architects, when we define the API strategy for any organization, we also need to design the 'Rate Limiting' features for that API. The concept of Rate Limiting is not new and the term has been used in networking world for long to represent the control of rate of traffic over the internet.

Other common examples of Rate Limiting that we see very often are as follows:
  1. Limit consecutive wrong password entries to 3.
  2. Maximum size of an email attachment.
  3. Max number of emails one can send in a day.
  4. Max number of search queries one can fire every minute.
  5. Max. broadband download size per day, etc. 
Rate Limiting is also an important line of defense from a security perspective. Jeff Atwood has a good blog post on 'Rate Limiting' available at: http://blog.codinghorror.com/rate-limiting-and-velocity-checking/

For services or APIs, there are standard ways in which we can rate limit the requests. For e.g.
  • Based on API key: This is how Twitter rate limits their API. Each account with a API key can only make x requests/{time period}. For e.g. 10 requests every 5 mins, 500 requests per day, etc.
  • Based on IP address: This may not work behind a proxy due to NATing.

Tuesday, May 20, 2014

Appending the current date to the file-name in a DOS batch program

I was writing a utility batch for my backup folders and wanted to have the current date appended to the filename. Using just %DATE% was throwing errors as the default output of the date command contains a space on Windows. For e.g. echo %DATE% would return "Tue 05/20/2014".

The following format of the %DATE% command did the job for me. A good trick to have in your sleeve :)
echo %date:~-10,2%-%date:~-7,2%-%date:~-4,4%

This formatting essentially trims chars from the end and then truncates. Just copy-paste fragments of the above string to understand how this works. For e.g. echo %date:~-10%

I had used this to create a date-stamped jar file as follows.
jar -cvf backup_%date:~-10,2%-%date:~-7,2%-%date:~-4,4%.jar data/*

Wednesday, May 14, 2014

Ruminating on Insurance Agents, Brokers, Producers

In the insurance industry, the terms 'Agent', 'Broker' and 'Producer' are used interchangeably many a times. But in different markets, they have different meanings and also governed by different regulations. Jotting down the information I have gathered after discussions and Q&A sessions with my friends in the Insurance industry.

  • Agents have a primary alliance with the insurance carrier, whereas Brokers have a primary alliance with the insurance buyer. But in the Healthcare industry, both the terms are used interchangeably and agents/brokers are also called as 'Producers'.
  • Agents can be 'captive' or 'independent'. A captive agent only represents a single insurer. He is typically on the salary rolls of the carrier and earns a commission on every policy he sells. An independent agent can represent multiple insurance carriers. Independent insurance agents are not on the insurance carriers salary rolls and earn only commissions. Several insurance carriers may authorize an agent to sell for them. 
  • Independent insurance agents may also work with insurance intermediaries, that aggregate quotes from multiple insurance carriers and allows the agent to compare and select the best fit for the customer. Independent agents also provide packaged policies - for e.g. combining auto and home insurance as a single policy. The customer benefits with lower premiums. 
  • Both captive and independent agents have a contract with the insurance carrier that details out the the binding authority of the agent - essentially the authority to bind a policy on the insurer’s behalf.
  • Brokers typically do not have the authority to bind policies. Since brokers cannot bind policies, they have to obtain a binder from the insurance carrier. A binder is a legal document that serves as a temporary insurance policy for around 30 days, and must be signed by a representative of the insurer. A binder is replaced by a policy, once the policy is generated.
  • Brokers may or may not earn commissions from the insurance carrier. They get a flat fee from the insurance buyer for their services. 
  • Brokers can be retail or wholesale. Retail brokers directly engage with the end customers. Sometimes for very specialized insurance needs, retail brokers may contact a wholesale broker. For e.g. a wholesale broker can specialize in auto-manufacturing liability insurance, etc. 
  • Commissions are of two types - a flat (base) commission that is paid for every policy sold and a incentive commission if a particular volume is met or other growth targets are met. There is a lot of debate on the incentive commissions received by independent agents and brokers. This is because these bonuses may affect the neutrality of broker who is supposed to represent the insured. In many countries there are regulations around brokers providing disclosures to customers on the commissions that they would earn. 

Friday, May 02, 2014

Ruminating on the #hashtag economy

The '#' (hash) symbol was originally created for twitter users to categorize their messages. It was used by Twitter users to identify keywords and trending topics. After Twitter, users on other Social channels such as Instagram, Tumblr and Pinterest jumped on the bandwagon and started using #hashtags to participate in online conversations on a topic. A good article explaining the origin of hashtag is available here.

Ironically Facebook adopted #hashtags quite late in the game. But today Facebook has full support for hashtags and when we click on a hashtag, we see a feed of all posts (what people are saying) about that event or topic. Hashtags have become so popular that recently Obama encouraged citizens to use the hashtag "#1010Means" to protest against low minimum wages.

Needless to say hashtags are a powerful concept for advertisers to capitalize on. It helps advertises to market their products/services to the right audience; who are interested in a particular subject.

Organizations have also stated using the power of hashtags in Social channels for building innovative services that bring in new sources of revenue or help improve customer satisfaction rates. This whole new paradigm is known as the "#hashtag economy". 

For e.g. Amex has creatively used hashtags to send promotions to their customers. https://sync.americanexpress.com/twitter/Index

Kotak Bank has created a Customer Self-Service platform on Twitter; wherein customers can tweet the right hashtags to perform banking transactions such as checking balance, request for checkboook, etc. http://economictimes.indiatimes.com/industry/banking/finance/banking/kotak-mahindra-bank-links-current-accounts-to-twitter/articleshow/32733759.cms

Wednesday, April 23, 2014

Ruminating on Network Port Mirroring

For any network sniffer (analyzer) or Network Intrusion Detection Systems to work, the concept that is applied behind the scenes is 'Network Port Mirroring'.

Port mirroring is needed for traffic analysis on a switch because a switch normally sends packets only to the port to which the destination device is connected. Hence most switches support configuring a 'port mirroring' to send a copy of each network packet to an other port (local port or a separate VLAN port).

The following links are worth a perusal.
http://searchnetworking.techtarget.com/definition/port-mirroring

https://www.juniper.net/techpubs/en_US/junos12.2/topics/concept/port-mirroring-qfx-series-understanding.html

Monday, April 14, 2014

Updating content in an iOS app

Any mobile app needs to have a design strategy for updating content from the server. We were exploring multiple options for retrieving content from the server and updating the local cache in our iOS app.
After considerable research, we have found that iOS 7 provides a very neat design for background fetching of new content - one that uses silent push events to raise events in the client app. Even if the app is not running, it would be launched in the background (with UI invisible, rendered off-screen) to process the event. The following article gives a very good overview of the technique.

Some snippets from the above article:

A Remote Notification is really just a normal Push Notification with the content-available flag set. You might send a push with an alert message informing the user that something has happened, while you update the UI in the background. 
But Remote Notifications can also be silent, containing no alert message or sound, used only to update your app’s interface or trigger background work. You might then post a local notification when you’ve finished downloading or processing the new content. 

 iOS 7 adds a new application delegate method, which is called when a push notification with the content-available key is received. Again, the app is launched into the background and given 30 seconds to fetch new content and update its UI, before calling the completion handler.

How is the App launched in the background? 
If your app is currently suspended, the system will wake it before callingapplication: performFetchWithCompletionHandler:. If your app is not running, the system will launch it, calling the usual delegate methods, includingapplication: didFinishLaunchingWithOptions:. You can think of it as the app running exactly the same way as if the user had launched it from Springboard, except the UI is invisible, rendered offscreen.

Thursday, March 20, 2014

Mobile Device Management Products

My team was helping our organization to evaluate different Mobile Device Management (MDM) tools for enterprise level deployment. The following 2 articles are an excellent read for understanding the various features provided by MDM tools and how products compare to each other.

http://www.computerworld.com/s/article/9245614/How_to_choose_the_right_enterprise_mobility_management_tool

http://www.computerworld.com/s/article/9238981/MDM_tools_Features_and_functions_compare

Monday, March 17, 2014

Ruminating on Distributed Logging

Of late, we have been experimenting with different frameworks available for distributed logging. I recollect that a decade back, I had written my own rudimentary distributed logging solution :)

To better appreciate the benefits of distributed log collection, it's important to visualize logs as streams and not files, as explained in this article.
The most promising frameworks we have experimented with are:

  1. Logstash: Logstash combined with ElasticSearch and Kibana gives us a cool OOTB solution. Also Logstash is developed on the Java platform and was very easy to setup and start running. 
  2. Fluentd: Another cool framework for distributed logging. A good comparison between Logstash and Fluentd is available here
  3. Splunk: The most popular commercial tool for log management and data analytics. 
  4. GrayLog: A new kid on the block. Uses ElasticSearch. Need to keep a watch on this. 
  5. Flume: Flume's main goal is to deliver data from applications to Apache Hadoop's HDFS. It has a simple and flexible architecture based on streaming data flows.
  6. Scribe: Scribe is written in C++ and uses Thrift for the protocol encoding. This project was released as open-source by Facebook.  

Sunday, March 16, 2014

Calling a secure HTTPS webservice from a Java client

Over the last decade, I have seen so many developers struggle with digital certificates when they have to call a secure webservice. A lot of confusion arises when a secure https webservice call is made from a servlet running in Tomcat. This is because the exception stack shows a SSLHandshake exception and then developers keep fiddling with the Tomcat connector configuration as stated here.

But when we make a connection to a secure server, what we need is to trust the digital certificate of the server. If the digital certificate of the server has been signed by a trusted root authority such as 'Verisign', 'eTrust', then our default Java Trust Store would automatically validate it. But if the server has a self-signed certificate, then we have to add the server's digital certificate to the trust store.

There are multiple ways of doing this. A long time ago, I had blogged about one option that entails setting the Java system properties. This can be done through code or by setting the Java properties of the JVM during startup. For e.g.


System.setProperty("javax.net.ssl.trustStore", trustFilename );
System.setProperty("javax.net.ssl.trustStorePassword", "changeit") ;



Different AppServers (WebSphere, Weblogic, etc.) may provide different ways to add certs to the trust store.

Another option is to create a cert-store (filename:jssecacerts) that contains the digital cert of the server and copy that cert-store file to the “$JAVA_HOME\jre\lib\security” folder. There is also a nifty program called InstallCert.java that downloads the certificate and creates the cert-store file. A good tutorial on the same is available here.
I have also created a mirror of InstallCert.java here. This program cam be run without any dependencies on external libraries and I have found it to be very handy.

So what is the difference between setting the TrustStore system property and adding the jssecacerts file?
Well, the documentation of JSSE should help our understanding here. The TrustManager performs the following steps to search for trusted certs:
1.  system property javax.net.ssl.trustStore
2.  $JAVA_HOME/lib/security/jssecacerts
3. $JAVA_HOME/lib/security/cacerts (shipped by default)

It's important to note that is the TrustManager finds the jssecacerts file, then it would not read cacerts file! Hence it may be a better option to add the server digital cert to the cacerts keystore file. To add a certificate to a keystore, there is a nice GUI program called portecle. Alternatively do it from the command prompt using the keytool command as stated here

Wednesday, February 19, 2014

MongoDB support for geo-location data

It was interesting to learn that FourSquare uses MongoDB to store all its geo-spatial data. I was also enlightening to see JSON standards for expressing GPS coordinates and other geo-spatial data. The following links would give quick information about GeoJSON and TopoJSON.

http://en.wikipedia.org/wiki/GeoJSON
http://en.wikipedia.org/wiki/TopoJSON
http://bost.ocks.org/mike/topology/
https://github.com/mbostock/topojson/blob/master/README.md

MongoDB supports the GeoJSON format and allows us to easily build location aware applications. For e.g. Using MongoDB geospatial query operators, you can -
  • Query for locations contained entirely within a specified polygon.
  • Query for locations that intersect with a specified geometry. We can supply a point, line, or polygon, and any document that intersects with the supplied geometry will be returned.
  • Query for the points nearest to another point.
As you can see these queries make it very easy to find documents (JSON objects) that are near a given point or documents that lie within a given polygon.

Ruminating on SSL handshake using Netty

I was again amazed at the simplicity of using Netty for HTTPS (SSL) authentication. The following sample examples are a good starting point for anyone interested in writing a HTTP server supporting SSL.

http://netty.io/5.0/xref/io/netty/example/http/snoop/package-summary.html

Also adding support for 2-way SSL authentication (aka mutual authentication) is also very simple. Here are some hints on how to get this done.

http://stackoverflow.com/questions/9573894/set-up-netty-with-2-way-ssl-handsake-client-and-server-certificate
http://maxrohde.com/2013/09/07/setting-up-ssl-with-netty/

Tuesday, February 18, 2014

Ruminating on IoT Security

The Internet of Things (IoT) is going to be the next big investment area for many organizations as the value of real-time data keeps on increasing in this hyper competitive world.

Cisco CEO states that IoT is going to be a 19 trillion $ opportunity. Coca Cola has blocked 16 million mac addresses that would be utilized for its smart vending machines. In the Healthcare space, increasing survival rates have resulted in an aging population. With age comes chronic illness and to address this, Payers and Providers are investing in home-care appliances to monitor patient data - heart/pulse rate, temperature, glucose level, etc.

The next couple of years would see a plethora of smart devices connected to the internet and this presents a very challenging security problem. Already poorly secured smart devices have been hacked and compromised. Fridges have been used to send spam messages. Smart TVs are spying on viewers home network.

One of the fundamental non-technical challenge is that manufactures of IoT devices have very little incentive to invest in patching old sensors/devices for security. The supply chain starts from the MCU (micro-controller) manufacturer to the OEM and these folks are busy releasing new versions of their products and do not have the time and energy to patch their old products for security. A very good article describing this is available here.

On the technology front, the challenge is that sensor devices have limited resources for implementing industry standard encryption techniques. A typical sensor (MCU - Micro Controller Unit) would  just have a processor of 32MHz and 256KB memory.  Also most sensor systems are proprietary and closed, making it difficult to patch security updates in a open way.

There is no easy solution to the above problem. The only hope is that the newer MCUs developed for IoT devices have enough processing power to support digital certificates. Using Digital certificates enables us to satisfy all facets of enterprise security - a) Authentication b)Authorization c) Integrity d)Confidentiality e)Non-Repudiation.
Verizon has a comprehensive suite of products for enabling IoT security using PKI/digital certificate technologies. 

Monday, February 17, 2014

How are hash collisions handled in a HashMap/HashTable?

In my previous post, I had explained how HashMap is the basic data-structure used in many popular big data stores including Hadoop HBase. But how is a HashMap actually implemented as?

The following links give a very good explanation of the internal working of a HashMap.
http://java.dzone.com/articles/hashmap-internal
http://howtodoinjava.com/2012/10/09/how-hashmap-works-in-java/

Whenever a key/value pair is added to the HashMap data structure, the key is hashed. The hash of the key determines in what 'bucket' or 'array index' the value would be stored at.
So what happens if there is a hash collision? The answer is - Each bucket is actually implemented as a LinkedList. The values are stored in a LinkedList. So for keys with same hash, a linear search is done in the linkedList.

HashSet vs HashMap

Today, one of my developers was using a HashMap to store a white-list of IP addresses. He was storing the IP address as both the key and value.
I told him about the HashSet class that can also be used to store a white-list of elements and similar to the HashMap; a HashSet also provides a time-complexity of 1 for get()/contains() operations.

We were a bit intrigued on how a HashSet actually works; so we looked into the source code. Much to our surprise, a HashSet actually uses a HashMap internally. For every item we add to the set, it adds it as a key to the HashMap and adds a dummy object as a value, as given below.

// Dummy value to associate with an Object in the backing Map
96     private static final Object PRESENT = new Object();

In theory, a Set and Map are separate concepts. A set cannot contain duplicates and is unordered, whereas a Map is a key/value pair collection.

Besides the HashSet, you also have 2 more classes or Set implementations that are useful. The first being TreeSet that sorts the elements as you store them in the collection. Second is the LinkedHashSet that maintains the insertion order. 

Wednesday, February 12, 2014

Ruminating on Star Schema

Sharing a good YouTube video on the basic concepts of a Star Schema. Definitely worth a perusal for anyone wanting a primer on DW schemas.



The data is the dimension tables can also change, albeit less frequently. For e.g. a state changes it's name. A customer changes his last name, etc. These are known as 'slowly changing dimensions'. To support slowly changing dimensions, we would need to add timestamp columns to the dimension table. A good video explaining these concepts is available below:


Ruminating on Column Oriented Data Stores

There is some confusion in the market regarding column oriented databases (aka column-based). Due to the popularity of NoSQL stores such as HBase and Cassandra, many folks assume that that column oriented stores are only for unstructured big data. The fact is that column-oriented tables are available both in SQL (RDBMS) and NoSQL stores.

Let's look at SQL RDBMS first to understand the difference between row-based and column-based tables.
First and foremost, it is important to understand that whether the RDBMS is column or row-oriented is a physical storage implementation detail of the database. There is NO difference in the way we query against these databases using SQL or MDX (for multi-dimension querying on cubes).

In a row-based table, all the attributes (fields of a row) are stored side-by-side in a linear fashion. In a column-based table, all the the data in a particular column are stored side-by-side. An excellent overview of this concept with illustrative examples is given on Wikipedia. The following illustration should clear the concept easily.


For any query, the most expensive operations are hard disk seeks. As you can see from the above diagram, based on the type of table storage, certain queries would run faster. Column-oriented tables are more efficient when aggregates (or other calculations) need to be computed over many rows but only for a notably smaller subset of all columns of data. Hence column-oriented databases are popular in OLAP databases, where as row-oriented databases are popular in OLTP databases.
What's interesting to note is that the with the advent of in-memory databases, disk seek time is no longer a constraint. But large DWs contain peta-bytes of data and all data cannot be kept in memory.

A few years back, the world was divided between column-based databases and row-based databases. But slowly, almost all database vendors are giving the flexibility to store data in either row-based or column-based tables in the same database. For e.g. SAP HANA and Oracle 12c. Also all these databases are adding in-memory capabilities, thus boosting the performance of databases many-fold.

Now in case of NOSQL stores, the concept of column-oriented is more-or-less the same. But the difference is that stores such as HBase allow us to have different number of columns for each row. More information on the internal schema of HBase can be found here

Ruminating on BSON vs JSON

MongoDB uses BSON  as the default format to store documents in collections. BSON is a binary-encoded serialization of JSON-like documents. But what are the advantages of BSON over the ubiquitous JSON format?

In terms of space (memory footprint), BSON is generally more efficient than JSON, but it need not always be so. For example, integers are stored as 32 (or 64) bit integers, so they don't need to be parsed to and from text. This uses more space than JSON for small integers.

The primary goal of BSON is to enable very fast traversability i.e faster processing. BSON adds extra information to documents, e.g. length prefixes, that make it easy and fast to traverse. Due to this the BSON parser can parse through the documents at blinding speed.

Thus BSON sacrifices space efficiency for processing efficiency (a.k.a traversability). This is a fair trade-off for MongoDB where speed is the primary concern. 

Ruminating on HBase datastore

While explaining the concept of HBase to my colleagues, I have observed that folks that do NOT have the baggage of traditional knowledge on RDBMS/DW are able to understand the fundamental concepts of HBase much faster than others. Application developers have been using data structures (collections) such as HashTable, HashMap for decades and they are better able to understand HBase concepts.

Part of the confusion is due to the terminology used in describing HBase. It is categorized as a column-oriented NOSQL store and is different from a row oriented traditional RDBMS. There are tons of articles that describe HBase as a data structure containing rows, column families, columns, cells, etc. In reality, HBase is nothing but a giant multidimensional sorted HashMap that is distributed across nodes. A HashMap consists of a set of key-value pairs. A multidimensional HashMap is one that has 'values' as other HashMaps.

Each row in HBase is actually a key/value map. This map can have any number of keys (known as columns), each of which has a value. This 'value' can again be a HashMap which has version history with timestamps. 

Each row in HBase can also store multiple key/value maps. In such cases, each key/value map is called a 'column family'. Each column-family is typically stored on different physical files or disks. This concept was introduced to support use-cases where you can have two sets of data for the same concept that are not generally accessed separately. 

The following article gives a good overview of the concepts we just discussed with JSON examples.
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Once we have a solid understanding of HBase concepts, it is useful to look at some common schema examples or use-cases where HBase datastore would be valuable. Link: http://hbase.apache.org/book/schema.casestudies.html

Looking at the use-cases in the above link, it is easy to understand that a lot of thought needs to go into designing a HBase schema. There would be multiple options/ways to design a HBase schema based on the primary use case of how data is going to be accessed. In other words, the design of your HBase schema would be dependent on your use-case. This is in total contrast to traditional data warehouses, where the applications accessing the data warehouse are free to define their own use-cases and the warehouse would support almost all use-cases within reasonable limits.

The following articles throw more light on the design constraints that should be considered while using HBase.
http://www.chrispwood.net/2013/09/hbase-schema-design-fundamentals.html
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf

For folks who do not have a programming background, but want to understand HBase in terms of table concepts, then the following YouTube video would be useful.
http://www.youtube.com/watch?v=IumVWII3fRQ

Tuesday, February 11, 2014

Examples of Real Time Analytics in Healthcare

The Healthcare industry has always been striving for reducing the cost of care and improving the quality of care. To do this, Payers are focusing more on prevention and wellness of members.

Digitization of clinical information and real-time decision making are important IT capabilities that are required today.  Given below are some examples of real time analytics in Healthcare.

  1. Hospital acquired infections are very dangerous for premature infants. Monitors can detect patterns in infected premature babies up to 24 hrs in advance before any symptoms are shown. This real time data can be captured and run through a real-time analytics engine to identify such cases and ensure that adequate treatment is given ASAP. 
  2. Use real time event processing to act as an early warning system, based on historical patterns or trends. For e.g. If few members are exhibiting behavior patterns of prior patients who relapsed into critical condition, then we can plan a targeted intervention for these members.  

Ruminating on Decision Trees

Decision trees are tree-like structures that can be used for decision making, classification of data, etc.
The following simple example (on the IBM SPSS Modeler Infocenter Site) shows a decision tree for making a car purchase.

Another example of a decision tree that can be used for classification is shown below. These diagrams are taken from the article available at - www.cse.msu.edu/~cse802/DecisionTrees.pdf‎



Any tree with a branching factor of 2 (only 2 leafs) is called as a "binary decision tree". Any tree with a variety of branching factors can be represented in an equivalent binary tree. For e.g. the below binary tree will evaluate to the same result as the first tree.


It is easy to see that such decision tree models can help us in segmentation. For e.g. segmentation of patients into high-risk and low-risk categories; high-risk credit vs. low risk credit; etc.
An excellent whitepaper on Decision Trees by SAS is available here.

Decision trees can also be used in predictive modeling - this is known as Decision Tree Learning of Decision Tree Induction. Other names for such tree models are classification trees or regression trees; aka Classification And Regression Tree (CART).
Essentially "Decision Tree Learning" is a data mining technique using which a decision tree is constructed by slicing and dicing the data using statistical algorithms. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like segments.
For e.g. On Wikipedia, there is a good example of a decision tree that was constructed by looking at the historic data of titanic survivors.
Decision Tree constructed through Data Mining of Titanic passengers.
Once such a decision tree model has been created, it can be exported as a standard PMML file. This PMML file can then be used in a real time scoring engine such as JPMML.

There is another open source project called as 'OpenScoring' that uses JPMML behind the scenes and provides us with a REST API to score data against our model. A simple example (with probability prediction mode) for identifying a flower based on attributes is illustrated here: https://github.com/jpmml/openscoring 

Decision Trees can also be modeled in Rule Engines. IBM iLog BRMS suite (WODM) supports the modeling of rules as a Decision Tree.