Wednesday, August 05, 2015

Ruminating on Data Lake

Anyone contemplating to understand a Data Lake should peruse the wonderful article by Martin Fowler on the topic - http://martinfowler.com/bliki/DataLake.html

Jotting down important points from the article -

  1. Traditional data warehouse (data marts) have a fixed schema - it could be a star schema or a snowflake schema. But having a fixed schema imposes many restrictions for data analysis. A Data Lake is essentially schema-less. 
  2. Data warehouses also typically cleanse the incoming data and improve the data quality. They also aggregate data for faster reporting. In contrast, a Data Lake stores raw data from source systems. It is up-to the data scientist to extract the data and make sense of it. 
  3. We still need Data Marts - Because the data in a data lake is raw, you need a lot of skill to make any sense of it. You have relatively few people who work in the data lake, as they uncover generally useful views of data in the lake, they can create a number of data marts each of which has a specific model for a single bounded context.A larger number of downstream users can then treat these lake-shore marts as an authoritative source for that context.