Saturday, September 08, 2018

Ruminating on Kafka consumer parallelism

Many developers struggle to understand the nuances of parallelism in Kafka. So jotting down a few points that should help from the Kafka documentation site.
  • Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
  • Publishers can publish events into different partitions of Kafka. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record).
  • The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism. 
Unlike other messaging middleware, parallel consumption of messages (aka load-balanced consumers) in Kafka is ONLY POSSIBLE using partitions. 

Kafka keeps one offset per [consumer-group, topic, partition]. Hence there cannot be more consumer instances within a single consumer group than there are partitions
So if you have only one partition, you can have only one consumer (within a particular consumer-group). You can of-course have consumers across different consumer-groups, but then the messages would be duplicated and not load-balanced.