Wednesday, August 05, 2015

Ruminating on Data Lake

Anyone contemplating to understand a Data Lake should peruse the wonderful article by Martin Fowler on the topic -

Jotting down important points from the article -

  1. Traditional data warehouse (data marts) have a fixed schema - it could be a star schema or a snowflake schema. But having a fixed schema imposes many restrictions for data analysis. A Data Lake is essentially schema-less. 
  2. Data warehouses also typically cleanse the incoming data and improve the data quality. They also aggregate data for faster reporting. In contrast, a Data Lake stores raw data from source systems. It is up-to the data scientist to extract the data and make sense of it. 
  3. We still need Data Marts - Because the data in a data lake is raw, you need a lot of skill to make any sense of it. You have relatively few people who work in the data lake, as they uncover generally useful views of data in the lake, they can create a number of data marts each of which has a specific model for a single bounded context.A larger number of downstream users can then treat these lake-shore marts as an authoritative source for that context.

Monday, July 27, 2015 - A nifty tool

We used to use browser tools such as Firebug to find out more 'backend' information about a particular site - for e.g. what servers does it run on? What server-side web technology is being used? What web content management tool is being used? etc.

Found a nifty website that gives all this info in the form of a neat table -
A useful tool to have in the arsenal for any web-master. 

Friday, July 24, 2015

Correlation does not imply Causation !

One of the fundamental tenets that any analytics newbie needs to learn is that - Correlation does not imply Causation !

Using statistical techniques, we might find a relationship between two events, but that does not mean that the occurrence of an event causes the other event. Jotting down a few amusing examples that I found from the internet.
  • The faster windmills are observed to rotate, the more wind is observed to be. Therefore wind is caused by the rotation of windmills 
  • Sleeping with one's shoes on is strongly correlated with waking up with a headache. Therefore, sleeping with one's shoes on causes headache.
  • As ice cream sales increase, the rate of drowning deaths increases sharply. Therefore, ice cream consumption causes drowning.
  • Since the 1950s, both the atmospheric CO2 level and obesity levels have increased sharply. Hence, atmospheric CO2 causes obesity.
  • The more firemen are sent to a fire, the more damage is done.
  • Children who get tutored get worse grades than children who do not get tutored
  • In the early elementary school years, astrological sign is correlated with IQ, but this correlation weakens with age and disappears by adulthood.
  • My dog is more likely to have an accident in the house if it’s very cold out.
A good site showcasing such spurious correlations is here -

Thursday, July 23, 2015

Using the Solver add-in in Excel for finding optimal solutions

Today we learned about a nifty tool in Excel that can be used to solve 'maximizer' or 'most optimal' solution to problems. For e.g. Given a set of constraints, should we make cars or trucks.

The below links would give a quick idea on how to use this tool to find out optimal solutions and also carry out 'what-if' analysis. You enter the objective, constraint and decision variable cells and let the tool do the magic.

Wednesday, July 15, 2015

How can large enterprises compete with new-age digital startups?

Chief Executive magazine recently featured an article by Nitin Rakesh on how large enterprise can compete with digital startups. The article is available at the following links:
Retraining Goliath to face digital David

The article advises large enterprises to capitalize on their strengths - i.e.

a) Utilize financial power to acquire digital competitors - How Allstate acquired Esurance..
b) Leverage existing brand equity - How Amex partnered with Walmart to launch Bluebird..
c) Mine existing customer data - Leverage customer insights to deliver highly personalized services.
d) If possible collaborate rather than compete with digital startups.

Thursday, June 25, 2015 - Next Generation Web Crawler

We had used many open source web crawlers in the past, but recently a friend of mine referred me to a cool tool at essentially parses the data on any website and structures it into a table of rows/columns - "Turn web pages into data". This data can be exported as an CSV file and it also provides a REST API to extract the data. This kind of higher abstraction over raw web crawling can be extremely useful for developers.

We can use the magic tool for automatic extraction or use their free tool to teach it how to extract data. 

Ruminating on Email marketing software

Recently we were looking for a mass email software for a marketing use-case. Jotting down the various online tools/platforms that we are currently evaluating.

  1. Mailjet - Has a free plan for 200 emails/day
  2. MailChimp - Has a free plan for 12000 emails/month
  3. Campaign Monitor
  4. Active Campaign 
  5. Salesforce Marketing Cloud 

APIs in Fleet Management

Fleet Management software is used by fleet owners to manage their moving assets. The software enables them to have a centralized data-store of their vehicle and driver information and also maintain maintenance logs (service and repair tracking).

The software also allows us to schedule preventive maintenance activities, monitor fuel efficiency, maintain fuel card records, calculate metrics such as "cost per mile" etc. You can also setup reminders for certification renewals and license expiration.

It was interesting to see Fleetio (a web based fleet management company) roll out a API platform for their fleet management software. Their vision is to become a digital hub for all fleet related stuff and turn their software product into a platform that can be leveraged by partners to create a digital ecosystem.

The API would allow customers to seamlessly integrate data in Fleetio with their operational systems in real time. For e.g. Pulling work orders from your fleet management system and pushing it to your accounting software in real time. Pushing mileage updates from a bespoke remote application to your fleet management software, Integrate driver records with Payroll systems, etc. All the tedious importing and exporting of data is gone !

TomTom also has a web based fleet management platform called as WEBFLEET that provides an API (Webfleet.connect) for integration. The Fleetlynx platform also has an API to integrate with Payroll and Maintenance systems.

Saturday, June 20, 2015

Ruminating on bimodal IT

Over the past couple of years, Gartner has been evangelizing the concept of bimodal IT to organizations for succeeding in the digital age. A good note by Gartner on the concept is available here.

Mode 1, which refers to the traditional "run the business" model focuses on stability and reliability.
Mode 2, which are typically "change the business" initiatives focus on speed, agility, flexibility and the ability to operate under conditions of uncertainty.

Bimodal IT would also need resources with different skills. As an analogy, Mode 1 IT resources would be the marathon runners, whereas Mode 2 IT resources need to be like sprinters. It would be difficult for a IT resource to be both. There is a risk that he might relegate to a mid-distance runner...and today's IT does not need mid-distance runners..

Tuesday, June 16, 2015

Ruminating on Section 508 Accessibility standards

In the UX world, you would often come across the phrases such as "compliance with Section 508". So what exactly is Section 508 and how does it relate to User Experience?

"Section 508" is actually an amendment to the Workforce Rehabilitation Act of 1973 and was signed into a law in 1998. This law mandates that all IT assets developed by or purchased by the Federal Agencies be accessible by people with disabilities. The law has stated web guidelines that should be followed while designing and developing websites.

It is important to note that Section 508 does not directly apply to private sector web sites or to public sites which are not U.S. Federal agency sites. But there are other forces at play, that may force a organization to make their websites accessible. The ADA (Americans with Disabilities Act) that was passed way back in 1990 prohibits any organization to discriminate on the basis of disability.
The following link reveals examples of law suites filed for violation of ADA -

Beyond the legal regulations, there are also open initiatives aimed at improving the accessibility of websites. W3C has an initiative named "Web Accessibility Initiative (WAI)" that lays down standards and guidelines for accessibility. There is also a standard for content authoring called - "Web Content Accessibility Guidelines (WCAG)".

The following sites provide good reading material on Accessibility -

Jotting down the high level guidelines that should be followed for accessibility.

  1. A text equivalent for every non-text element shall be provided (e.g., via "alt", "longdesc", or in element content).
  2. Equivalent alternatives for any multimedia presentation shall be synchronized with the presentation. For e.g.  synchronized captions.
  3. Web pages shall be designed so that all information conveyed with color is also available without color, for example from context or markup. Color is not used solely to convey important information. Ensure that foreground and background color combinations provide sufficient contrast when viewed by someone having color deficits or when viewed on a black and white screen. 
  4. Documents shall be organized so they are readable without requiring an associated style sheet. If style-sheets are turned off, the document should still be readable. 
  5. Client-side image maps are used instead of server-side image maps. Appropriate alternative text is provided for the image as well as each hot spot area.
  6. Data tables have column and/or row headers appropriately identified (using the element).
  7. Pages shall be designed to avoid causing the screen to flicker with a frequency greater than 2 Hz and lower than 55 Hz. No element on the page flashes at a rate of 2 to 55 cycles per second, thus reducing the risk of optically-induced seizures.
  8. When electronic forms are designed to be completed on-line, the form shall allow people using assistive technology to access the information, field elements, and functionality required for completion and submission of the form, including all directions and cues.
  9. When a timed response is required, the user shall be alerted and given sufficient time to indicate more time is required.

Friday, June 12, 2015

Implementing sliding window aggregations in Apache Storm

My team was working on implementing CEP (Complex Event Processing) capabilities using Apache Storm. We evaluated multiple options for doing so - one option was using a lightweight in-process CEP engine like Esper within a Storm Bolt.

But there was another option of manually implementing CEP-like aggregations (over a sliding window) using Java code. The following links show us how to do so.

Rolling Count Bolt on Github

While the above code would help in satisfying certain scenarios, it would not provide the flexibility of a CEP engine. We need to understand that CEP engines like (Tibco BE, Esper, StreamInsights) are fundamentally different from Apache Storm; which is more of a highly distributed stream computing platform.

A CEP engine would provide you with SQL like declarative queries and OOTB high level operators like time window, temporal patterns, etc. This brings down the complexity of writing temporal queries and aggregates. CEP engines can also detect patterns in events. But most CEP engines do not support a distributed architecture.

Hence it makes sense to combine CEP with Apache Storm - for e.g. embedding Esper within a Storm Bolt. The following links would serve as good reference -

Monday, June 01, 2015

Ruminating on Shipping Containers and Docker

Today during one of the lectures at IIMB, I was introduced to a book called 'The Box' by Mark Levinson.

The book narrates the story of how the invention of the shipping container completely changed the face of global commerce. A snippet from the book -

"the cost of transporting goods was decisive in determining what products they would make, where they would manufacture and sell them, and whether importing or exporting was worthwhile. Shipping containers didn't just cut costs but rather changed the whole economic landscape. It changed the global consumption patterns, revitalizing industries in decay, and even allowing new industries to take shape."

A nice video explaining the same is available on YouTube -

A similar revolution is happening in the IT landscape by means of a new software container concept called as Docker. In fact, the logo of Docker contains an image of shipping containers :)

Docker provides an additional layer of abstraction (through a docker engine, a.k.a docker server) that can run a docker container containing any payload. This has made it really easy to package and deploy applications from one environment to the other.

A Docker container encapsulates all the code and its dependencies required to run an application. They are quite different from virtualization technology. A hypervisor running on a 'Host OS' essentially loads the entire 'Guest OS' and then runs the apps on top of it. In Docker architecture, you have a Docker engine (a.k.a Docker server) running on the Host OS. Each Docker server can host many docker containers. Docker clients can remotely talk with Docker servers using a REST API to start/stop containers, patch them with new versions of app, etc.

A good article describing the differences between them is available here -


All docker containers are isolated from each other using the Linux Kernel process isolation features.

In fact, it is these OS-level virtualization features of Linux that has enabled Docker to become so successful.

Other OS such as Windows or MacOS do not have such features as part of their core kernel to support Docker. Hence the current way to run Docker on them is to create a light-weight Linux VM (boot2docker) and run docker within it. A good article explaining how to run Docker on MacOS is here -

Docker was so successful that even Microsoft was forced to admit that it was a force to reckon with !
Microsoft is now working with Docker to enable native support for docker containers in its new Nano server operating system -

This IMHO, is going to be a big game-changer for MS and would catapult the server OS as a strong contender for Cloud infrastructure. 

Ruminating on bare metal cloud environments

Virtualization has been the underpinning technology that powered the Cloud revolution. In a typical virtualized environment, you have the hypervisor (virtualization software) running on the Host OS. These type of hypervisors are called "Type 2 hypervisor".

But there are hypervisors that can be directly installed on hardware (i.e. hard disk). These hypervisors, know as "Type 1 hypervisors" do not need a host OS to run and have their own device drivers and other software to interact with the hardware components directly. A major advantage of this is that any problems in one virtual machine do not affect the other guest operating systems running on the hypervisor.

The below image from Wikipedia gives a good illustration.

Thursday, May 14, 2015

Ruminating on Apple HealthKit backup

While my team was working on the Apple HealthKit iOS APIs, we came to know a few interesting things that many folks are not aware of. Jotting down our findings -
  • HealthKit data is only locally stored on the user's device
  • HealthKit data is not automatically synced to iCloud - even if you have enabled iCloud synching for all apps. 
  • HealthKit data is not backed up as part of normal device backup in iTunes. So if you restore your device, all HealthKit data would be lost !
  • HealthKit is not available on iPads. 
The only way we can take a backup of HealthKit data is to enable "encrypted backup" in iTunes. If this option is selected in iTunes, then your HealthKit data would get backed up.

Another interesting point from a developer's perspective is that the HealthKit store is encrypted on the phone and is accessible by authorized apps only when the device is unlocked. If the device is locked, no authorized app can access the data during that time. But apps can continue sending data via the iOS APIs. 

Thursday, February 05, 2015

Comparing two columns in excel to find duplicates

Quite often, you have to compare two columns in excel to find duplicates or 'missing' rows. Though there are many ways to do this, the following MS article gives a simple solution.

Depreciation of fixed assets in accounting

Would like to recommend the following site that gives a very simple explanation of the concept of depreciation in accounting. Worth a perusal for beginners.

Monday, January 26, 2015

Patient Engagement Framework for Healthcare providers

HIMSS (Healthcare Information and Management Systems Society) has published a good framework for engaging patients so as to improve health outcomes.

Patients want to be engaged in their healthcare decision-making process, and those who are engaged as decision-makers in their care tend to be healthier and have better outcomes. The whole idea to is to treat patients not just as customers, but partners in their journey towards wellness.

The following link provides a good reference for designing technology building blocks for improving patient experience.

Inform Me --- Engage Me --- Empower Me --- Partner with me

Ruminating on Open Graph Protocol

Ever wondered how some links on Facebook are shown with an image and a brief paragraph? I dug deeper to understand what Facebook was doing behind the scenes to visualize the link.

To my surprise, there was something called as "Open Graph Protocol" that defined a set of rules for telling Facebook, how your shared contents should be displayed on it.

For e.g. we can add the following meta-tags in any web page and Facebook would parse these tags when you post the link to this page.

  • <meta property=”og:title” content=” “/>
  • <meta property=”og:type” content=””/>
  • <meta property=”og:url” content=””/>
  • <meta property=”og:image” content=””/>
  • <meta property=”fb:admins” content=””/>
  • <meta property=”og:site_name” content=””/>
  • <meta property=”og:description” content=””/>

  • More information can be found at this link -

    Router blocking HTTPS traffic?

    Recently I had got a new cloud router for my broadband connection. Though the speed was very good, I was facing intermittent problems in accessing HTTPS sites. For e.g. webmail would hang sometimes, payment gateway pages would not load, Amazon app would not load screens, etc.

    At first, I was not sure if the router was to blame, or was it the internet connection itself. A quick google search revealed that this is a common problem faced by many routers and had to do with the MTU (Maximum Transmission Unit) size limit. I was surprised that the MTU size would affect HTTPS which is a application level protocol.

    The following links show an easy method to find out the correct MTU size for your network using the ping command. For e.g. ping -f -l 1472

    Thursday, January 15, 2015

    Applying Analytics to Clinical Trails

    The below link is a good article on using Big Data Analytics to improve the efficiency of clinical trials.

    Snippets from the article -

    "Recruiting patients has been a challenge for pharmaceutical companies. 90 per cent of trials are delayed with patient enrollment as a primary cause.  
    Effective target segmentation for enrollment is a key to success. Traditional methods of enrollment rely upon campaign and segmentation based on disease lines across wider populations. Using data science, we can look at the past data to identify proper signals and help planners with more precise and predictive segmentation. 

    Data scientists will look at the key attributes that matter for a given patient to successfully get enrolled. For each disease type, there may be several attributes that matter. For example, a clinical trial that is focused on a new diabetes medication targets populations’ A1C levels, age group, demographics, outreach methods, and site performance. Data science looks at the above attribute values for the target users past enrollment data and then builds ‘patient enrollment propensity’ and ‘dropout propensity’ models. These models can generate multi variant probabilities for predicting future success. 

    In addition to the above modeling, we can identify the target segment’s social media footprint for valuable clues. We can see which outreach methods are working, and which social media channels the ‘generation Googlers’ are using.  Natural language processing (NLP) techniques to understand the target population’s sentiment on clinical trial sites, physicians, and facilities can be generated and coded into a machine understandable form. Influencer segments can be generated from this user base to finely tune campaign methods for improving effectiveness."

    Thursday, January 08, 2015

    Ruminating on the Open Bank Project

    A colleague of mine introduced me to the 'Open Bank Project'. It's an interesting open source project to create a generic API layer on top of various core banking products. Third-party developers would then use these APIs to build cool mobile and social apps.

    The open source project was started by a Berlin based company named Tesobe. Right now, they seem to have successfully built adapters/connectors to 3 German banks and are planning to add connectors for more banks. A full list is given here -

    A few sample apps have been built using the Open Bank API -

    I think the concept is interesting, but the challenge would be to build the connectors to the various core banking products out there. Will download the APIs from github and evaluate them further. 

    Monday, September 22, 2014

    Exploring Apache Kafka..

    We had successfully used ActiveMQ and RabbitMQ in many projects and never felt the need to explore any other message broker. Today, my colleague introduced me to 'Apache Kafka' and was drooling over the high performance and reliability it provided. Kafka is extensively used within LinkedIn and can be used in many use-cases.

    The following blog post gives a good performance benchmark of Kafka.

    Another good blog post worth reading is:

    Another good tutorial on using Kafka to push messages to Hadoop is available here -

    Thursday, September 11, 2014

    Monitoring TOMEE using VisualVM

    A few years back, we moved to Jboss from Tomcat for our production servers, because there was no viable enterprise support for Tomcat.

    Today, we have viable options such as support from Tomitribe.

    The below article on Tomitribe gives a good overview of setting up VisualVM for monitoring Tomcat.

    Default tools in the JDK

    Found the below article worth a perusal. We get so used to using sophisticated tools that we forget there are things we can do with a bare JDK :)

    Monday, September 01, 2014

    Does Digital Transformation need 'Skunk Works' kind of environment?

    Skunk Works is a term that originated during WWII and is the official alias for Lockheed Martin’s Advanced Development Programs (ADP).

    Today Skunk Works is used to describe a small group of people who work on advanced technology projects. This group is typically given a high degree of autonomy and unhampered by bureaucracy. The primary driver of setting up a Skunk Works team is to develop something quickly with minimal management constraints.
    The term also refers to technology projects developed in semi-secrecy, such as Google X Lab.or the 50 people team established by Steve Jobs to develop the Macintosh computer.

    For any organization embarking on a Digital Transformation journey, it would be worthwhile to build such as Skunk Works team that can innovate quickly and bring an idea to a required threshold of technology readiness. I have seen so many ideas die under the shackles of bureaucracy and long processes. Having a skunk works team operate like a start-up within your organization can do wonders in leap-frogging your competition in the digital age.

    Monday, August 25, 2014

    Ruminating on Showrooming and Webrooming in the Digital Age

    When e-Commerce giants such as Amazon took the retail industry by storm, there was a lot of FUD on showrooming. As a digital native, even I indulged in showrooming before heading out to my favourite e-commerce site to buy the product online.

    But a recent study conducted in US has found that many folks also engage in reverse showrooming (aka webrooming). In reverse showrooming," or "webrooming," consumers go online to research products, but then actually go to a bricks-and-mortar store to complete their purchase.

    The following link on Business-Insider throws more details on this phenomenon.

    This report came as a surprise to me and I would assume that retailers are happy about this trend :)
    Retailers are also trying out innovative techniques to capitalize on this trend. Some of them include deploying knowledgeable sales staff that educate the customer and create a superior in-store customer experience. BLE technology enabled beacons push personalized offers to the customer mobile app while he is in the store. m-Wallets would enable contact-less and hassle-free payments at POS.

    Retailers are also embracing BOPiS (Buy Online Pick Up In Store) ! This greatly reduces the logistics/shipping costs, as the existing transportation network is used for delivery.

    Popular e-Commerce software vendors such as Hybris have also started catering to this market and have an in-store solution for retailers.

    Friday, August 22, 2014

    A good comparison of BLE and Classic Bluetooth

    The following link gives a good overview of the differences between BLE (Bluetooth low energy) and classic bluetooth. Definitely worth a perusal.

    The fundamental reason why BLE is becoming so popular in beacons is the extremely Low Power Consumption of BLE devices. Its low power consumption makes it possible to power a small device with a tiny coin cell battery for 5–10 years !

    Tuesday, August 12, 2014

    How does Facebook protect its users from malicious URLs?

    The following post gives a good overview of the various techniques (such as link shim) used by Facebook to protect its users from malicious websites - whose links would be embedded in posts.

    Facebook has its internal blacklist of malicious links and also queries external partners such as McAfee, Google, Web of Trust, and Websense.  When FB detects that a URL is malicious, it displays an interstitial page before the browser actually requests the suspicious page. This protects the user, who now has to make a conscious decision as to whether he wants to proceed to the malicious page.

    BTW, if you have not already installed the 'Web of Trust' browser plugin for your browser, do so immediately :)

    Another interesting point was the fact that it is more secure to run a check at click time than at display time. If one relied on display-time filtering alone, we would not be able to retroactively block any malicious URLs - lying in an email or an old page.

    Wednesday, July 09, 2014

    Collection of free books from Microsoft

    Eric Lingman has provided links to a large collection of free Microsoft books on a variety of topics on his blog post (link below).

    Some of the books that I found interesting were on Azure Cloud Design Patterns, SharePoint, Office 365, etc.

    Tuesday, June 03, 2014

    Categorization of applications in IT portfolio

    During any portfolio rationalization exercise, we categorize applications based on various facets, as explained in one of my old posts here.

    Interestingly, Gartner has defined three application categories, or "layers," to distinguish application types and help organizations develop more appropriate strategies for each of them.

    Snippets from the Gartner news site (

    Systems of Record — Established packaged applications or legacy homegrown systems that support core transaction processing and manage the organization's critical master data. The rate of change is low, because the processes are well-established and common to most organizations, and often are subject to regulatory requirements.
    Systems of Differentiation — Applications that enable unique company processes or industry-specific capabilities. They have a medium life cycle (one to three years), but need to be reconfigured frequently to accommodate changing business practices or customer requirements.
    Systems of Innovation — New applications that are built on an ad hoc basis to address new business requirements or opportunities. These are typically short life cycle projects (zero to 12 months) using departmental or outside resources and consumer-grade technologies.