|
|
|
|
|
by Xorlev
3518 days ago
|
|
One of the biggest problems with treating Kafka as a job queue is that you suffer from head-of-line blocking. Kafka doesn't expose per-message visibility/acknowledgement semantics like RabbitMQ/Redis PUSH+POP/SQS does. Each consumer group tracks offsets into the partitions of a log (aka a topic). This offset is just a number that points to a specific message in the Kafka partition. If you get stuck on message 123, you either can't proceed to 124, proceed and don't commit your offset but risk replaying 124, or skip 123. A great many of our services publish to Kafka, those consuming services which seek to treat individual records as tasks (or bundles of tasks) as opposed to a linear log must either skip failures or push them onto SQS for background retry. Our batching consumers have to track out-of-order completion of work and commit up to the lowest completed offset, meaning a slow task can delay offset commits. If a consumer is stopped before finishing that slow task, we have to replay work which means all work has to be idempotent. In practice, it works well enough, but it's still some gymnastics. I suspect this is why Google invested so much into making PubSub scalable despite per-message semantics. It's considerably simpler in many ways, even if you have to bake in your own ordering/monotonicly increasing identifiers. |
|
I've also thought about reserving a topic + consumer group specifically for failed jobs and bake the retry logic into KQ itself. But that's an area I must explore more.
I'm not sure if I understand what you are saying about batching consumers. What do you mean by batching in this context? Thanks for your input.