Hacker News new | ask | show | jobs
by timeagain 975 days ago
In a similar vein, something about queuing that has annoyed me as a developer for multiple large FANG corporations is poor thinking about queue metrics. The TLDR is that metics provided by the queue itself are rarely helpful for knowing if your service is healthy, and when it is not healthy they are not very useful for determining why.

Most queue processing services that I have seen have an alarm on (a) oldest message age, and (b) number of messages in the queue.

In every team I joined I have quickly added a custom metric (c) that subtracts the time of successful processing from the time that a message was /initially/ added to the queue. This metric tends to uncover lots of nasty edge cases regarding retries, priority starving, and P99 behavior that are hidden by (a) and (b).

Having 100000 messages in the queue is only an issue if they are not being processed at (at least) 100000/s. Having a 6-hour-old message in the queue is concerning, but maybe it is an extreme outlier, so alarming is unnecessary. But you can bet your bottom dollar that if your average processing latency spikes by 10x that you want to know about it.

The other thing that is nice about an end to end latency metric is that (a) and (b) both tend to look great all the way up to the point of failure/back pressure and then they blow up excitingly. (c) on the other hand will pick up on things like a slight increase in application latency, allowing you to diagnose beforehand if your previously over-provisioned queue is becoming at-capacity or under-provisioned.

2 comments

Makes sense!

I was just talking with a Temporal solutions engineer this week and this metric is their recommended one for autoscaling on. Instead of autoscaling on queue depth, you scale on queue latency! Specifically for them they split up the time from enqueue to start, and then the time from start to done, and you scale on the former, not the total ("ScheduleToStart" in their terms).

Time from enqueue to start isn't a good metric - it completely disregards the queue size. Enqueuing 1M jobs won't change this metric as it only updates once the job reaches the front of the queue, and when the 1Mth job does that the situation is already over.

I had much better results with a metric that shows estimated queue time for jobs that are getting enqueued right now (queue_size * running_avg_job_processing_time / parallelism).

>>> I was just talking with a Temporal solutions engineer

Aha! Just as the second season of Loki dropped. Makes sense now

Less sarcastically - this ties in with the article i guess. runat time is the enqueue, and then you are arguing for two latencies - time enqueue to start and start to complete.

Exactly! Other queue metrics have too many false-positives and false-negatives.
Slow working processes are completely normal and required in the real world. It’s not an extreme outlier.

Enterprise processes can wind between many intermediaries. Hours, days, weeks, maybe even months.