Monday, September 22, 2025

Ruminating on Parquet, Delta Lake and Iceberg

In this blogpost, I shall try to demystify some of the concepts around Parquet, Delta Lake and Iceberg in three easy learning steps.

Step 1: Let's start with the basics first - what is Parquet? (pronounced as paa.kay)
A Parquet file is a type of data storage format optimized for handling large datasets. Think of it like a highly organized filing cabinet for data. Instead of storing data in rows (like a traditional spreadsheet), Parquet stores data in columns. This makes it super efficient for queries where you only need specific columns, as it doesn’t have to scan the entire dataset.
Key Features of Parquet:
  • Columnar Storage: Stores data by columns, not rows, which speeds up queries for specific fields.
  • Compression: Shrinks data to save space and make reading faster.
  • Compatibility: Works well with big data tools like Hadoop, Spark, and others
Parquet file format is great for storing large datasets where you need to analyze specific columns, like sales data or user activity logs.
Example: Imagine you have a massive table with customer names, ages, and purchases. If you only want to analyze purchases, Parquet lets you grab just that column quickly without touching the rest.

Step 2: What is a Delta Lake?
Delta Lake is a storage layer that builds on top of Parquet files to add extra features for managing data. Think of it as a smart manager for your Parquet files, ensuring your data stays reliable, consistent, and easy to work with over time.
Delta Lake stores data in Parquet files but adds a transaction log (in JSON format) that tracks changes, versions, and metadata. This provides ACID (atomicity, consistency, isolation, durability) compliance, allowing safe streaming, updates, deletes, and inserts. Delta Lake supports efficient management of many Parquet files within a table, handling schema evolution, and concurrency control.
A great article that illustrates the value that Delta Lake storage layer provides on top of Parquet is here: https://delta.io/blog/delta-lake-vs-parquet-comparison/
Delta Lake storage layer was spearheaded by the company Databricks.

Step 3: What is a Apache Iceberg?
Apache Iceberg is another storage layer, similar to Delta, but with a focus on flexibility and performance for massive datasets. It was spearheaded by the Apache opensource foundation, but later embraced by Snowflake. It was first created by Netflix and Apple. It also uses the Parquet file format by default, but also supports other file formats like ORC and Avro. 
Parquet and ORC are columnar, best for read-heavy analytical workloads due to their high compression and query performance, with Parquet being the default. Avro is a row-based format, ideal for write-heavy streaming or ingestion workloads due to its faster write times.
The battle between Databricks' Delta Lake and Snowflake's adoption of Apache Iceberg is heating up, reflecting broader shifts in data architecture toward open, interoperable lakehouses. What started as specialized storage layers for big data has evolved into a high-stakes rivalry, with both companies vying for dominance in unified analytics platforms. 
Databricks, the creator of Delta Lake, has poured resources into open table formats, including a massive acquisition to bolster Iceberg support. Meanwhile, Snowflake, a cloud data warehousing giant, is aggressively embracing Iceberg to enhance its multi-engine capabilities and counter Databricks' lakehouse momentum.
A good article comparing both these storage layers is here: https://dataengineeringcentral.substack.com/p/delta-lake-vs-apache-iceberg-the
In terms of capabilities, both of these storage layers are neck-to-neck with each player bringing in feature parity in a few weeks. The choice ultimately depends on your existing tech stack, scale requirements, and long-term goals. If openness and multi-tool support matter more, go with Iceberg; for Spark-optimized efficiency, choose Delta Lake.
Please note that recently Databricks announced full support of Iceberg :) -- https://www.databricks.com/blog/announcing-full-apache-iceberg-support-databricks

Thursday, September 04, 2025

JSON-RPC vs REST and why JSON-RPC is used in MCP?

I was going down the rabbit hole of MCP protocol details and realised that it was using JSON-RPC instead of REST. 

JSON-RPC is a simple protocol that lets a program on one computer run a function on another computer. It uses JSON to send and receive the requests and responses, making it easy to use and understand.

It is transport-agnostic and can work over HTTP, TCP, sockets, or other message-passing environments. A JSON-RPC request typically includes the method to be called, parameters for that method (optional), and an ID to match the response.

Given below is a simple example of a request and response:


Given below are the top 3 differences in JSON-RPC vs REST for API design:

Architecture Style:

  • JSON-RPC: RPC-oriented, focusing on invoking specific methods or procedures on the server (e.g., calling a function like getBalance()). It treats interactions as direct commands.
  • REST: Resource-oriented, centered on manipulating resources (e.g., /users/{id}) using standard HTTP methods like GET, POST, PUT, and DELETE.

Endpoints and HTTP Methods:

  • JSON-RPC: Uses a single endpoint (e.g., /rpc) with POST requests for all method calls, simplifying routing but limiting HTTP verb usage.
  • REST: Employs multiple endpoints (e.g., /users, /orders) and leverages various HTTP methods (GET, POST, PUT, DELETE) to represent different actions on resources.

Request/Response Structure:

  • JSON-RPC: Requests and responses follow a strict JSON format with fields like "method", "params", "id", and "result" or "error". Supports batching natively.
  • REST: Uses flexible request formats (URL paths, query parameters, headers) and responses rely on HTTP status codes (e.g., 200, 404) with custom payloads; batching requires custom implementation.
MCP chose JSON-RPC because its lightweight, single-endpoint design ensures fast and efficient communication for real-time AI tasks. It supports batch requests, allowing multiple operations in one call, which suits MCP’s complex AI workflows.