Hacker News new | ask | show | jobs
by joohwan 3517 days ago
Very true. I indeed found the lack of visibility into per-message information very painful when I was building this. One way I tried to alleviate the issue was providing a consumer "callback" to make it easier for users to plug their own code in to handle job failures (like your example of using SQS).

I've also thought about reserving a topic + consumer group specifically for failed jobs and bake the retry logic into KQ itself. But that's an area I must explore more.

I'm not sure if I understand what you are saying about batching consumers. What do you mean by batching in this context? Thanks for your input.

1 comments

We have some consumers which treat log entries as tasks, and often it's handy to debounce some of the work into larger chunks that can be executed in parallel. The chunks can be linear or they could be grouped by some property of the message (e.g. account id). In that case, we have batches of messages with multiple non-consecutive offsets, e.x. [123, 145, 155], [122, 124, 144]. In practice, that means inserting each message offset into a per-partition sorted set of pending work. When a batch completes, all the offsets in that batch are marked as "complete" and we commit the lowest safe offset. Using the example above, if the batch [122, 124, 144] completed, we'd still have [123, 145, 155] outstanding which means the lowest safe offset is 122* even though 124 and 144 also completed in batch 1. Until that second batch completes, 123 is still outstanding making it the barrier to commiting a higher offset.

Our batching consumers provide pluggable behavior for handling a failing batch, but usually it's pushed onto SQS since those can cycle around a few times until we notice and fix whatever condition is preventing progress on that work.

* - 123 actually, as if you commit offset 123 the consumer will fetch offset 123 again on start, but that's implementation esoterica