Saturday, September 08, 2018

Ruminating on Kafka consumer parallelism

Many developers struggle to understand the nuances of parallelism in Kafka. So jotting down a few points that should help from the Kafka documentation site.
  • Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
  • Publishers can publish events into different partitions of Kafka. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record).
  • The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism. 
Unlike other messaging middleware, parallel consumption of messages (aka load-balanced consumers) in Kafka is ONLY POSSIBLE using partitions. 

Kafka keeps one offset per [consumer-group, topic, partition]. Hence there cannot be more consumer instances within a single consumer group than there are partitions
So if you have only one partition, you can have only one consumer (within a particular consumer-group). You can of-course have consumers across different consumer-groups, but then the messages would be duplicated and not load-balanced. 

Batch ETL to Stream Processing

Many of our customers are moving their traditional ETL jobs to real-time stream processing.
The following article is an excellent read of why Kafka is an excellent choice for unified batch processing and stream processing.

https://www.infoq.com/articles/batch-etl-streams-kafka

Snippets from the article:

  • Several recent data trends are driving a dramatic change in the old-world batch Extract-Transform-Load (ETL) architecture: data platforms operate at company-wide scale; there are many more types of data sources; and stream data is increasingly ubiquitous
  • Enterprise Application Integration (EAI) was an early take on real-time ETL, but the technologies used were often not scalable. This led to a difficult choice with data integration in the old world: real-time but not scalable, or scalable but batch.
  • Apache Kafka is an open source streaming platform that was developed seven years ago within LinkedIn.
  • Kafka enables the building of streaming data pipelines from “source” to “sink” through the Kafka Connect API and the Kafka Streams API.
  • Logs unify batch and stream processing. A log can be consumed via batched “windows”, or in real time by examining each element as it arrives.