Hacker News new | ask | show | jobs
by jpollock 138 days ago
Measurement and alerting is usually done in business metrics, not the causes. That way you catch classes of problems.

Not sure about expected loss, that's a decay rate?

But stuck jobs are via tasks being processed and average latency.