Tech Talk: Data Architecture

Showing posts with label Data Architecture. Show all posts

Wednesday, April 30, 2025

Ruminating on "Zero ETL"

Traditional ETL workflows consist of three separate stages:

Extracting data from source systems
Transforming it to fit analytical needs
Loading it into a target database or data warehouse.

Although this method is reliable, it often requires considerable time, resources, and technical effort, which can introduce delays and impede timely decision-making.

In contrast, Zero ETL eliminates these conventional steps by allowing direct access to source data and applying transformations on-the-fly during query execution. This significantly reduces latency, limits unnecessary data movement, and streamlines the integration process. By harnessing modern cloud infrastructure and sophisticated query tools, Zero ETL offers a more efficient, scalable approach to data management.

Examples of Zero ETL:

AWS’s Zero ETL solutions, such as the integration between Amazon Aurora and Amazon Redshift, allow organizations to query data across systems without constructing traditional pipelines.
The Snowflake Data Cloud supports federated queries and data sharing, enabling access to data across platforms without ETL processes.
Google Cloud BigQuery Omni facilitates cross-cloud analytics, allowing users to query data residing in AWS, Azure, or Google Cloud Platform without data replication.
Airbyte is a popular open-source data integration engine that automates the movement of data from various sources to destinations (data warehouses, lakes, databases) with minimal custom coding. Airbyte offers over 350 pre-built connectors, orchestration features, and robust security, making it suitable for streamlined, scalable data integration without heavy ETL pipelines.

Wednesday, March 19, 2025

Ruminating on Data Mesh and Data-as-a-Product

In today’s fast-paced, data-driven world, organizations are constantly looking for better ways to handle the massive amounts of data they generate and use. Traditional techniques frequently rely on a single, centralized staff to manage all the data—like a gigantic control center managing the central data warehouse or data lake. But as firms develop and data grows more complicated, this traditional technique may become sluggish, wasteful, and impossible to scale.

Data Mesh is a fresh and innovative strategy that’s altering how organizations think about data. Data Mesh is a decentralized way to manage data. Instead of one team being in charge of everything, Data Mesh spreads the responsibility across different groups—or "domains"—within the organization. A department such as product development, sales, or marketing could be considered a domain. Each domain owns its own data, meaning they collect it, store it, maintain its quality, and make it available to others.

Data Mesh is built on a few key ideas:

Domain-Driven Ownership: Each team takes full control of the data tied to their area of work. For example, the sales team manages sales data, while the customer support team handles support-related data.
Self-Service: Domains get access to platforms and technology that let them manage their data independently, without always needing help from a central IT team. Even though data is managed separately, there are company-wide standards to make sure everything connects smoothly and stays secure.
Data as a Product: One of the standout ideas in Data Mesh is treating data as a product. This concept is borrowed from how companies build software products—with a focus on making them user-friendly, reliable, and well-supported. In a Data Mesh, each domain doesn’t just store data, they polish it up and package it like a product that others in the organization can easily use.

The marketing team would build a ready-to-use "customer behavior data product," making it easily accessible via an API. This allows other departments, such as product design and leadership, to directly utilize reliable, well-organized data without needing to process raw data or request assistance.

Benefits of Data Mesh Architecture: Giving teams ownership of their data promotes scalability, allowing them to manage their own data needs as the company expands. It also accelerates workflows, as teams can independently develop and share data products. This ownership drives higher data quality, as teams rely on its accuracy, and provides the flexibility to adjust data to changing demands, leading to a more responsive organization.

Potential Challenges:While decentralization offers benefits, it necessitates careful coordination to prevent data silos and ensure interoperability. Managing numerous independent data products presents complexity, requiring teams to have adequate technical resources and skills. Robust governance is also crucial to avoid data duplication and security breaches.

For example - By using Data Mesh, an online retailer lets teams own their data: product manages catalogs, customer service handles reviews, and logistics oversees shipping. These "data products" are then easily accessible to other teams, like for a live sales dashboard, without needing a central data team, while maintaining consistency through shared standards.

Data Mesh is a mindset shift, not just technology, that empowers teams to own their data as products, unlocking its full potential, especially in large companies. Though requiring setup efforts, it leads to faster, smarter, and more adaptable data use.

Tuesday, January 21, 2025

Ruminating on Standardizing Data

In the realm of statistics, we frequently face datasets of varied sizes and units. This might make it difficult to compare variables or use specific statistical approaches. To solve this challenge, we use a strong approach known as standardization.

Essentially, standardization transforms our original data into a new dataset where:

Mean:The average value of the new dataset is 0.
Standard Deviation:The measure of data dispersion around the mean is 1.

This process is also known as "z-score transformation".

Below are the advantages of standarizing data:

Comparability: Standardized data enables direct comparison of variables recorded on various scales. For example, heights in meters can be compared to weights in kilos.
Model Development: Standardized data improves the performance of many statistical models, including regression and machine learning methods. This increases the model's accuracy and stability.
Outlier Detection: When data is normalized, it is easier to identify numbers that vary considerably from the norm.

The formula for standardizing a data point (x) is:

z (standard value) = (x - mean) / standard-deviation

Example:

Original data: 150, 160, 170, 180, 190
Mean (μ) = 170, Standard Deviation (σ) = 15.8
Standardized data: -1.27, -0.63, 0, 0.63, 1.27

Standardizing data is a fundamental technique in statistics and data science. By transforming data to have a mean of 0 and a standard deviation of 1, we gain valuable insights and improve the performance of various statistical analyses.

Friday, November 03, 2023

Ruminating on Debezium CDC

Debezium is a distributed open source platform for change data capture (CDC). It collects real-time changes to database tables and transmits them to other applications. Debezium is developed on top of Apache Kafka, which provides a dependable and scalable streaming data infrastructure.

Debezium operates by connecting to a database and watching for table updates. When a change is identified, Debezium creates a Kafka event with the change's information. Other applications, such as data pipelines, microservices, and analytics systems, can then ingest these events.

There are several benefits of utilising Debezium CDC, including:

Debezium feeds updates to database tables in near real time, allowing other applications to react to changes almost quickly.
Debezium is built on Apache Kafka, which provides a dependable and scalable streaming data platform.
Debezium can stream updates to a number of databases, including MySQL, PostgreSQL, Oracle, and Cassandra using connectors.
Debezium is simple to install and operate. It has connectors for major databases and may be deployed on a number of platforms, including Kubernetes/Docker.

Use cases for Debezium CDC:

Data pipelines and real-time analytics: Debezium can be used to create data pipelines that stream changes from databases to other data systems, such as data warehouses, data lakes, and analytics systems. For example, you could use Debezium to stream changes from a MySQL database to Apache Spark Streaming. Apache Spark Streaming can then process the events and generate real-time analytics, such as dashboards and reports.

Data synchronization: Debezium can be used to synchronize data between different databases or data systems.
Microservices integrations: Communicate with downstream microservices using Event Driven Architecture paradigm - https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/

Wednesday, July 20, 2022

Ruminating on Data Lakehouse

In my previous blog posts, we had discussed about Data Lakes and Snowflake architecture.

Since Snowflake combines the abilities of a traditional data warehouse and a Data Lake, they also market themselves as a Data Lakehouse.

Another competing opensource alternative that is headed by the company databricks is called Delta Lake. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes.

A good comparison between Snowflake and databricks Delta Lake is available here: https://www.firebolt.io/blog/snowflake-vs-databricks-vs-firebolt

Enterprises who are embarking on their data platform modernization strategy can ask the following questions to arrive at a best fit choice:

Does the data platform have separation of compute from storage? This will enable it to scale horizontally as your data volumes and processing needs increase.
Does the data platform support cloud native storage? Most of the cloud native storage services from the hyperscalers (e.g. AWS S3, Google Big Query, Azure Data Lake) have been battle tested for scalability.
What are the usecases you want to run on your data platform? - e.g. Reporting/BI, Streaming Data Analytics, Low latency dashboards, etc.

Wednesday, June 01, 2022

Ruminating on Data Gravity

Data Gravity is a concept that states that as more and more data gets accumulated (analogy of mass increasing), there would be greater attraction of Applications and Services towards this data.

There are two factors that accelerate this - Latency and Throughput. The closer the app/service is to the data, the lower is the latency and higher is the throughput.

Another facet of data gravity is that as large volumes of data gets concentrated in a datastore, it becomes extremely difficult to move that data (to the cloud or another DW).

Hyperscalers understand the concept of data gravity very well and hence are actively pushing for data to move to the cloud. As more and more data gets stored in Cloud DWs, NoSQL stores, RDBMS, it would become easier for developers to leverage hyperscaler PaaS services to build apps/APIs that exploit this data.

Saturday, September 08, 2018

Batch ETL to Stream Processing

Many of our customers are moving their traditional ETL jobs to real-time stream processing.
The following article is an excellent read of why Kafka is an excellent choice for unified batch processing and stream processing.

https://www.infoq.com/articles/batch-etl-streams-kafka

Snippets from the article:

Several recent data trends are driving a dramatic change in the old-world batch Extract-Transform-Load (ETL) architecture: data platforms operate at company-wide scale; there are many more types of data sources; and stream data is increasingly ubiquitous
Enterprise Application Integration (EAI) was an early take on real-time ETL, but the technologies used were often not scalable. This led to a difficult choice with data integration in the old world: real-time but not scalable, or scalable but batch.
Apache Kafka is an open source streaming platform that was developed seven years ago within LinkedIn.
Kafka enables the building of streaming data pipelines from “source” to “sink” through the Kafka Connect API and the Kafka Streams API.
Logs unify batch and stream processing. A log can be consumed via batched “windows”, or in real time by examining each element as it arrives.

Wednesday, August 16, 2017

Cool open source tool for ER diagrams

Recently one of my colleagues introduced me to a cool Java tool called SchemaSpy.
SchemaSpy is a java tool that can be used to create beautiful database documentation and also ER diagrams.

Would highly recommend perusing the following links and utilizing this tool:

GUI for SchemaSpy: http://schemaspygui.sourceforge.net/
Documentation: http://schemaspy.readthedocs.io/en/latest/
Source Code: https://github.com/schemaspy/schemaspy

Monday, May 30, 2016

Ruminating on IoT datastores

The most popular data-store choice for storing a high volume of IoT sensor data are NoSQL time-series databases.

The following link contains a good list of NoSQL time-series databases that can be used in an IoT project. We have worked with both OpenTSDB and KairosDB and found both of them to be enterprise grade.

http://www.erol.si/2015/01/the-complete-list-of-all-timeseries-databases-for-your-iot-project/

http://www.erol.si/2015/01/why-i-switched-from-opentsdb-to-kairosdb/

Thursday, February 25, 2016

Ruminating on EMPI

EMPI (Enterprise Master Patient Index) is a consolidated hub of all patient related information that would act as a single source of truth.

Hospitals have various departmental systems such as lab systems, radiology systems, EMR systems and other Health Information systems that operate in isolation. Typically patient data is spread out across these disparate systems and it is challenging to have a 360-degree view of the patient.

Hence, hospitals create an EMPI hub that assigns a unique ID to each patient. EMPI systems use algorithms to match and link records across disparate systems. The algorithms also identify duplicate records and reduce the number of false negatives. The typical attributes used by the matching algorithms are first name, last name, DOB, sex, social security number, address and more. The matching algorithms (deterministic, probabilistic/fuzzy) must consider typos, misspellings, transpositions, aliases, etc.

Besides the internal attributes, many organizations also source data from external third parties (e.g. Equifax) that can be used for increasing the accuracy of the matching engine. This is helpful as people change addresses, phone numbers, etc. with time.

Many traditional MDM product vendors such as IBM, InfoR provide platforms to implement EMPI.
Few organizations have also started using NoSQL and other Big Data platforms for creating a customer hub as explained here.

Wednesday, February 12, 2014

Ruminating on Star Schema

Sharing a good YouTube video on the basic concepts of a Star Schema. Definitely worth a perusal for anyone wanting a primer on DW schemas.

The data is the dimension tables can also change, albeit less frequently. For e.g. a state changes it's name. A customer changes his last name, etc. These are known as 'slowly changing dimensions'. To support slowly changing dimensions, we would need to add timestamp columns to the dimension table. A good video explaining these concepts is available below:

Thursday, September 19, 2013

Transactions in MongoDB

In my previous blog post, we had gone through some of the advantages of MongoDB in terms of schema flexibility and performance. The Metlife case study showed how we can quickly create a customer hub using MongoDB that supports flexible dynamic schemas.

Another added advantage of using MongoDB is that you don't have to worry about ORM tools, as there is no object-relational impedance mismatch. Also you don't have to worry about creating an application cache, as MongoDB be default uses all available memory for its working set.

But what about transactions? Any OLTP application would need full support for ACID to ensure reliability and consistency of data. The following articles shed good light on the transaction support in MongoDB.

http://docs.mongodb.org/manual/faq/fundamentals/
http://css.dzone.com/articles/how-acid-mongodb
http://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/

MongoDB only supports "Atomicity" at the document level. It's important to remember that we can have nested documents and MongoDB would support atomicity across the nested documents. But if we need multi-object transaction support, then MongoDB is not a good fit.

Also if your application needs to "join" objects frequently, then MongoDB is also not suitable in that respect. For e.g. loading reference data (static data) from master tables with the transaction data.
MongoDB locks (Readers-Writer lock) are at the database level; i.e. entire database gets locked during a write operation. This can result in lock contention when you have a large number of write operations.

Looking at the pros-n-cons of MongoDB, IMHO it is best suited for heavy-read operations type of application. For e.g. a consolidated high performance read-only customer hub, a data store for content management systems, product catalogs in e-commerce systems, etc.

Wednesday, September 04, 2013

NoSQL for Customer Hub MDM

The following article on informationweek is an interesting read on the use of MongoDB NoSQL for building a customer MDM solution.

http://www.informationweek.com/software/information-management/metlife-uses-nosql-for-customer-service/240154741

MongoDB being a document oriented NoSQL database has its core strength in maintaining flexible schemas and storing data as JSON or BSON objects. Lets look at the pros and cons of using MongoDB as a MDM solution.

One of the fundamental challenges faced is creating a customer hub is the aggregation of disparate data from a variety of different sources. For e.g. a customer could have bought a number of products from an insurance firm. Using a traditional RDBMS would entail complexities of joining the table records and fulfilling all the referential constraints of the data. Also each insurance product may have different fields and dimensions. Should we create a table for each product type? In MongoDB, you can store all the policies of the customer in one JSON object. You can store different types of policy for each customer with full flexibility and maintain a natural hierarchy (parent-child) of relationships.

Another problem that Insurance firms face is that of legacy policy records. Certain insurance products such as Annuity have a long life period,but a lot of regulations and business needs change over the years and your old policy records may not have all the fields that are captured in new policy records. How do you handle such cases? Having a strict schema would not help and hence a solution like MongoDB offers the necessary flexibility to store spare data.

MongoDB also has an edge in terms of low TCO for scalability and performance. Its auto-sharding capabilities enable massive horizontal scalability. It also supports OOTB memory-mapped files that is of tremendous help with the prominence of 64-bit computing and tons of available RAM.

On the negative side, I am a bit concerned about the integrity of data in the above solution. Since there is no referential integrity, are we 100% confident on the accuracy of data? We would still need to use data profiling, data cleansing and data matching tools to find out unique customers and remove duplicates.

Metlife is using this customer hub only for agents and has not exposed this data to the customers as there are concerns about data integrity and accuracy. But what if we need to enable the customer to self-service all his policies from a single window on the organizations portal ? We cannot show invalid data to the customer.

Also from a skills perspective, MongoDB needs specialized resources.Its easy to use and develop, but for performance tuning and monitoring you need niche skills.

Sunday, July 14, 2013

Street address normalization for MDM solutions

Quite often we need to verify street addresses or normalize them to check for duplicate customers. The following article gives a good overview of the various techniques we can use for this problem context.

http://brizzled.clapper.org/blog/2012/02/14/simple-address-standardization/

There are also a lot of commercial and open source software for address verification and geo-coding. For e.g.
http://smartystreets.com/
http://www.addressdoctor.com/en/
http://www.melissadata.com/

Thursday, July 11, 2013

Importance of Geocoding for business

Geocoding is the process of finding out the geographical coordinates (latitude/longitude) from street address, post code, etc. A lot of organizations are interested in geocoding their customer addresses, because it enables them to serve the customer better. For e.g.

A healthcare provider can use the geocoding information of its customers, to help them locate the nearest physician or pharmacy.
An insurance firm can use geocoding information to find out the actual physical location of an insured property and determine the underwriting risk for floods, earthquakes, etc.
E-commerce sites usually have a find-a-nearby-store option that enables customers to find out the nearest store to pick up their goods from, based on their GPS coordinates.

Thus geocoding can help a business in answering many questions that would help it drive growth. For e.g.

What geographical area to most of our customers come from?
Are there geographical areas where we have not penetrated? If yes, Why?
Is our sales force aligned with our customer territories?

Householding and Hierarchy Management

Found this rather interesting article on householding concepts in MDM. Many organizations struggle to define business rules to identify customers belonging to a same home or household.

http://www.information-management.com/news/1010001-1.html?zkPrintable=1&nopagination=1

Another interesting read is this blog post on how "http://www.muckety.com" is using hierarchy management to link VIPs/actors together. The graphs on muckety.com are interactive and worth a perusal if you are a hollywood buff :)

Tuesday, June 11, 2013

Ruminating on Data Masking

A lot of organizations are interested in 'Data Masking' and are actively looking out for solutions around the same. IBM and Informatica Data Masking tools are leaders in Gartner's magic quadrant.

The need for masking data is very simple - How do we share enterprise data that is sensitive with the development teams, testing teams, training teams and even the offshore teams?
Besides masking data, there are other potential solutions for the above problem - i.e. using Test Data Creation tools and UI playback tools. But data masking and subsetting continue to remain popular means of scrambling data for non-production use.

Some of the key requirements for any Data Masking Solution are:

Meaningful Masked Data: The masked data has to be meaningful and realistic. It should be capable of applying and satisfying all the business rules. For e.g. post codes, credit card numbers, SSN, bank account numbers, etc. E.g. if we change DOB, should we also change 'Age'.
Referential Integrity: If we are scrambling primary keys then we need to ensure that the relationships are maintained. One technique is to make sure that the same scramble functions are applied to all of the related columns. Sometimes, if we are masking data across databases, then we would need to ensure integrity across databases.
Irreversible Masking: The masked data should be irreversible and it should be impossible to recreate sensitive data.

A good architecture strategy for building a data-masking solution is to design a Policy driven Data Masking Rule Engine. The business users can then define policies for masking different data-sets.

A lot of data masking tool vendors are now venturing beyond static data masking. Dynamic Data Masking is a new concept that masks data in real time. Also there is a growing demand for masking data in unstructured content such as PDF, Word or Excel files.

Sunday, November 25, 2012

Free Data Modeling tools

There are a plethora of free and opensource UML modeling tools available in the market. But there is very little awareness regarding free database modeling tools. Jotting down a few free tools that are very good for ER modeling and support most popular databases for script generation.

Oracle SQL Developer Data Modeler: This cool modeling tool from Oracle is actually free! It supports physical model creation for SQL Server, IBM DB2 and ofcourse Oracle DBs.
DBDesigner: In the past, I have used this extensively when working with MySQL databases.
MySQL Workbench: Another good tool if you are using MySQL.
Aris Express: Software AG's popular Enterprise Architecture tool ARIS now comes with a express (community) edition that can be used for business process modeling and ER modeling.

Tuesday, October 30, 2012

Ruminating on Transaction Logs

Understanding the working of transaction logs for any RDBMS is very important for any application design. Found the following good articles that explain the important concepts in a simple language.

http://www.simple-talk.com/sql/learn-sql-server/managing-transaction-logs-in-sql-server/
http://www.techrepublic.com/article/understanding-the-importance-of-transaction-logs-in-sql-server/5173108

Any database consists of log files and data files. In the MS world, they are known as *.ldf and *.mdf files respectively. All database transactions (modifications) are first written to the log file. There is a separate thread (or bunch of threads) that writes from the buffer cache to the data file periodically. Once data is written to a data-file, a checkpoint is written to the transaction log. This checkpoint is used as a reference to "roll forward" all transactions. i.e. All transactions after the last checkpoint are applied to the datafile when the server is restarted after a failure. This prevents transactions from being lost that were in the buffer but not yet written to the data file.

Transaction logs are required for rollback, log shipping, backup, etc. The transaction log files should be managed by a DBA or else we would run into problems if the log file fills up all the available hard disk space. The DBA should also periodically back-up the log files. The typical back-up commands also truncate the log files. In some databases, the truncation process just marks old records as inactive so they can be overwritten. In such cases, even after truncation, the size of the log file does not reduce and we may have to use different commands to compact or shrink the log files.

A good post on truncation options for SQL Server 2008 is given below:
http://www.codeproject.com/Articles/380879/About-transaction-log-and-its-truncation-in-SQL-Se

Wednesday, October 17, 2012

Peformance impact of TDE at database level

I was always under the opinion that column-level encryption is better and more performant than database or tablespace level encryption. But after much research and understanding the internal working on TDE (Transparent Data Encryption) on SQLServer and Oracle, it does not look to be a bad deal !

In fact, if we have a lot of columns that need to be encrypted and also need to fire queries against the encrypted columns, then a full database (tablespace) level encryption using TDE seems to be the best option.

I was a bit skeptical on the issue of performance degradation in using full database TDE, but it may not be so. First and foremost, column-level (cell) encryption can severely affect the database query optimization functions and result in significantly worse performance than encrypting the entire database.
When we use TDE at the database (tablespace) level, then the DB engine can use bulk encryption for entire blocks of data as they are written to or read from the disk.

It is important to note that full database TDE actually works at the data-file level and not at each table/column level. To put it in other words, the data is not encrypted but rather entire data files (index files, log files, etc.) are encrypted.

Microsoft states that the performance degradation of using database level TDE is a mere 3-6%.
Oracle states that in 11g, if we use Intel XEON processesor with AES instruction set, then there is a "near-zero" impact on database performance.

It is important to note the terminology differences regarding TDE used by Microsoft and Oracle. Microsoft refers to full database encryption as TDE (not column-level). Oracle calls it TDE-tablespace and TDE-column level.

Also TDE is a proven solution from a regulatory perspective - e.g. PCI. Auditors are more comfortable approving a proven industry solution that any custom logic that is implemented in application code.