Hacker News new | ask | show | jobs
by peteradio 1416 days ago
What do you do for repeated failures? Does it get flagged for a manual debug or does it kick into a different mode of automation?
1 comments

We notice repeated failures because we have metrics on our "up to dateness", and those metrics will stall. We also send logs to CloudWatch logs and alarm on certain threshold of errors. Once an alarm fires, we investigate manually and see why the job is failing. This happens occasionally but not too much. While we are investigating, we are spinning up repeat jobs with some frequency, but this hasn't proved to be a problem.