|
|
|
|
|
by kgeist
1329 days ago
|
|
>Attempting to retry these failed jobs tied up our worker and it was unable to process new incoming events, resulting in a severe backlog in our queues. Interesting - we have this kind of thing quite often. Basically, an event is stuck in the queue due to a logic error or a prior race condition, and it's endlessly retried blocking the rest of the events from being processed. We can't just automatically remove such an event from the queue because events must be processed in order or client data can get corrupted. It requires manual intervention (we have alerts in place), and every time it's a new event so we have to be creative and think quickly - how to unblock the queue without corrupting client data by skipping events. After an event is unstuck, there's a huge queue of unprocessed events which can take up to a few hours to be emptied in worst cases. Fortunately we have some sharding in place so there can be several independent workers processing the same global queue - with workers' shard affinity we can process shard data in order AND in parallel, so SRE can temporarily increase the number of workers when the queue gets too large, to speed it up. I still don't know how to solve this kind of problem once and for all (i.e. to have zero manual intervention). Is it even solvable? |
|
I don't know much about your application but the fact that you can mitigate the problem by scaling the number of workers suggests that the order requirements might actually be fairly weak. As a worst case outcome you may be able to push all events interdependent to the one with an error to a DLQ using a temporary blacklisting mechanism, but by that stage I think I would just prefer better testing.