|
|
|
|
|
by danfarrelly
1088 days ago
|
|
> we had rolled our own monitoring which was itself a PITA to maintain Thanks! What type of monitoring were you looking for? We have some basic metrics now, but know we need to improve this. What metrics, alerting, observability are important for you? |
|
1. Wait timings for jobs.
2. Run timings for jobs.
3. Timeout occurrences and stdout/stderr logs of those runs
4. Retry metrics, and if there is a retry limit, then metrics on jobs that were abandoned.
One thing that is easy to overlook is giving users the ability to define a specific “urgency” for their jobs which would allow for different alerting thresholds on things like running time or waiting.