We notice repeated failures because we have metrics on our "up to dateness", and those metrics will stall. We also send logs to CloudWatch logs and alarm on certain threshold of errors. Once an alarm fires, we investigate manually and see why the job is failing. This happens occasionally but not too much. While we are investigating, we are spinning up repeat jobs with some frequency, but this hasn't proved to be a problem.