| HN Mirror

Essentially the problem is that Kafka doesn't have a way of managing acknowledgements on a per-message basis. This means that consumers from Kafka topics are assigned exclusive access on a per-partition basis and the consumer group manager only tracks acknowledged offset of each partition.

As such you end up with a few main problems. The first is head of line blocking, what this means is that if a consumer reads a message from a topic it's unable to process or will take an inordinate amount of time to process it can't move forward without potentially having to replay every message since the problematic message if it doesn't want to risk not replaying a message that wasn't processed correctly. Secondly it means that you can end up with hot partitions if the "cost" of messages isn't uniformly distributed across partitions because load isn't balanced across consumers, i.e there is no work stealing or other mechanism for other consumers to help out processing a hot partition.

Log systems with queue/subscription overlays like Pulsar and GCP Pub/Sub solve this by doing per-message acknowledgement (sometimes referred to as selective acknowledgement vs cumulative acknowledgement that Kafka does) usually by layering a persistent subscription abstraction over the top of the underlying log.

This is in contrast to pure queue systems like RabbitMQ, SQS etc that use a heap or mailbox approach where messages are simply emptied out as they are processed and don't share the log style struction of systems like Kafka.

So TLDR. If you use Kafka like a job queue you will end up in situations where queue processing gets stuck behind a single or patches of unprocessable messages.

The mitigations for it aren't pretty. They either involve building your own selective acknowledgement layer or a series of retry queues that messages are pushed onto using Kafka transactions with a final dead-letter queue at the end etc. Instead either wait for https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A... if you really want Kafka or use something that already does this, i.e Pulsar.