Tech Talk: Ruminating on Parquet, Delta Lake and Iceberg

In this blogpost, I shall try to demystify some of the concepts around Parquet, Delta Lake and Iceberg in three easy learning steps.

Step 1: Let's start with the basics first - what is Parquet? (pronounced as paa.kay)

A Parquet file is a type of data storage format optimized for handling large datasets. Think of it like a highly organized filing cabinet for data. Instead of storing data in rows (like a traditional spreadsheet), Parquet stores data in columns. This makes it super efficient for queries where you only need specific columns, as it doesn’t have to scan the entire dataset.

Key Features of Parquet:

Columnar Storage: Stores data by columns, not rows, which speeds up queries for specific fields.
Compression: Shrinks data to save space and make reading faster.
Compatibility: Works well with big data tools like Hadoop, Spark, and others

Parquet file format is great for storing large datasets where you need to analyze specific columns, like sales data or user activity logs.

Example: Imagine you have a massive table with customer names, ages, and purchases. If you only want to analyze purchases, Parquet lets you grab just that column quickly without touching the rest.

Step 2: What is a Delta Lake?

Delta Lake is a storage layer that builds on top of Parquet files to add extra features for managing data. Think of it as a smart manager for your Parquet files, ensuring your data stays reliable, consistent, and easy to work with over time.

Delta Lake stores data in Parquet files but adds a transaction log (in JSON format) that tracks changes, versions, and metadata. This provides ACID (atomicity, consistency, isolation, durability) compliance, allowing safe streaming, updates, deletes, and inserts. Delta Lake supports efficient management of many Parquet files within a table, handling schema evolution, and concurrency control.

A great article that illustrates the value that Delta Lake storage layer provides on top of Parquet is here: https://delta.io/blog/delta-lake-vs-parquet-comparison/

Delta Lake storage layer was spearheaded by the company Databricks.

Step 3: What is a Apache Iceberg?

Apache Iceberg is another storage layer, similar to Delta, but with a focus on flexibility and performance for massive datasets. It was spearheaded by the Apache opensource foundation, but later embraced by Snowflake. It was first created by Netflix and Apple. It also uses the Parquet file format by default, but also supports other file formats like ORC and Avro.

Parquet and ORC are columnar, best for read-heavy analytical workloads due to their high compression and query performance, with Parquet being the default. Avro is a row-based format, ideal for write-heavy streaming or ingestion workloads due to its faster write times.

The battle between Databricks' Delta Lake and Snowflake's adoption of Apache Iceberg is heating up, reflecting broader shifts in data architecture toward open, interoperable lakehouses. What started as specialized storage layers for big data has evolved into a high-stakes rivalry, with both companies vying for dominance in unified analytics platforms.

Databricks, the creator of Delta Lake, has poured resources into open table formats, including a massive acquisition to bolster Iceberg support. Meanwhile, Snowflake, a cloud data warehousing giant, is aggressively embracing Iceberg to enhance its multi-engine capabilities and counter Databricks' lakehouse momentum.

A good article comparing both these storage layers is here: https://dataengineeringcentral.substack.com/p/delta-lake-vs-apache-iceberg-the

In terms of capabilities, both of these storage layers are neck-to-neck with each player bringing in feature parity in a few weeks. The choice ultimately depends on your existing tech stack, scale requirements, and long-term goals. If openness and multi-tool support matter more, go with Iceberg; for Spark-optimized efficiency, choose Delta Lake.

Please note that recently Databricks announced full support of Iceberg :) -- https://www.databricks.com/blog/announcing-full-apache-iceberg-support-databricks

Tech Talk

Monday, September 22, 2025

Ruminating on Parquet, Delta Lake and Iceberg

No comments:

Post a Comment

Search This Blog

Total Pageviews

Categories

Blog Archive