Hacker News new | ask | show | jobs
by augustflanagan 3500 days ago
I built https://cronitor.io after having an important cron job fail silently for several days. When I mentioned this problem to a friend his first response was "we just had a major issue with cron failing silently at my work too".

We decided to hack on it together, and we've since grown Cronitor from a tool built for our own needs into a small business with a couple hundred paying customers.

3 comments

As someone about to write a cron job, what ways do you see Cron fail?

My assumption is that cron is robust and reliable, it's the job script itself that may fail silently and need monitoring, yes?

Yeah, I'd say your assumption is correct. However, other common ways that I've seen cron "fail" is for the cron entry to being installed in the wrong user's crontab, the permissions on a script changing such that it can't be executed, temporarily disabling or removing a cron job and forgetting to re-enable it, etc.
My favorite is forgetting a newline at the end of the crontab. I always put a warning at the bottom of my crontabs.

# Always end with a newline!

The most common failure I've seen is when people forget to use full paths in cron.

bash instead of /bin/bash (or similar)

There's other weirdness in cron-land too. Cron will completely ignore files containing "." .
The cron process can be killed by the oom_killer, resulting in missed crons. This may be aggrevated If the cron service is never respawned later.
I found where I used to work that cron jobs usually failed because they weren't adequately tested, so I tried using MAILTO but it didn't do what I wanted, so started putting something like (it's been a while):

45 5 * * * /bin/bash -eux -C /path/to/real/cmd >> /var/log/somefile 2&>1 | mail -S "/path/to/real/cmd... at 10.2.3.4 failed with details in /var/log/somefile" to-address < tail /var/log/somefile

...then test failures to make sure the email and everything worked as expected (like, mail might not be set up correctly on the box by default). And either overwrite /var/log/somefile with ">" instead of ">>" or use logrotate. Of course, the /path/to/real/cmd script, if a shell script, should have something like "set -eux" or at least "set -e" at the top (and be well tested), otherwise it won't always report failures and this has no chance of working.

I didn't (in mild use) see unreported failures after that, and it was really handy for problem diagnoses when something did go wrong thereafter.

But after any change I had to test carefully again every failure mode etc, because it seemed so easy to miss something that causes unexpected behavior. Maybe even had to wrap it in an "if" statement (single-line), "..else mail...".

It would be fun but time-consuming to automate those tests, maybe with shunit2 (or something named roughly like that), to rerun periodically and make sure ops didn't change the mail config to break this setup, or something.

I know that looks awful but I enjoyed it. It might just be easier to use your replacement. How did you advertise?

In case anyone ever reads that, too late for me to edit but: file 2&>1 | mail -S ...should be: file 2&>1 || mail -S ...or more likely the wrapped "if" mentioned.
also https://wdt.io/cron-monitoring.html , I think it's been awhile for awhile.