Hacker News new | ask | show | jobs
by barrkel 1083 days ago
Re-running failed jobs should be automated wherever possible. Expected, routine failures should not require manual intervention. If you don't have this attitude, toil will gradually increase over time until all anyone ever does is put out small fires.

Thumbnails not being generated might not be worth an early morning alarm, but running out of disk space might be, or not getting to do other work because it's blocked by the failure of thumbnail generation.

1 comments

Nothing should be done if it is not economical. In my experience, issues with the message busses happen very often in the first weeks after their rollout and then disappear for a while or forever.

This means: merge the PR first, let it go live, use your working students or interns to rerun stuff, wait for a month - if it is still happening, then you have a proof of a problem that needs to be fixed.

Disk space: use your monitoring tool to proactively warn you when the free disk space is below of 20% or is reducing too quickly.

If some other work is blocked by failed thumbnails, this is a logical bug and not the consequence of a message bus. This stuff has been blocked even before the introduction of the message bus anyways.