Ask HN: How do you catch cron jobs that "succeed" but produce wrong results?

Y	Hacker News new \| ask \| show \| jobs

1 points by BlackPearl02 143 days ago

I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the results are wrong.

Examples: Backup script completes successfully but creates empty backup files Data processing job finishes but only processes 10% of records Report generator runs without errors but outputs incomplete data Database sync completes but the counts don't match The logs show "success" — exit code 0, no exceptions — but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day.

I've tried: Adding validation checks in scripts (e.g., if count < 100: exit 1) — works, but you have to modify every script, and changing thresholds requires code changes Webhook alerts — requires writing connectors for every script Error monitoring tools (Sentry, etc.) — they catch exceptions, not wrong results Manual spot checks — not scalable

The validation-in-script approach works for simple cases, but it's not flexible. What if you need to change the threshold? What if the file exists but is from yesterday? What if you need to check multiple conditions? You end up mixing monitoring logic with business logic.

I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code.

How do you handle this? Are you adding validation to every script, proactively checking logs, or using something that alerts when results don't match expectations? What's your approach to catching these "silent failures"?

3 comments

razingeden 143 days ago

it’d depend on what exactly is failing there.

File missing:

if [ ! -f /tmp/file ]; then exit 1 fi

File doesn’t contain 100 lines:

COUNT=`cat /tmp/file |wc -l` if [ $COUNT -lt "100"]; then exit 1 fi

File doesn’t contain a known header or record :

HEADER=`egrep -c SOME_CSV_VALUE /tmp/file` if [ $HEADER -eq "0"]; then exit 1 fi

any of those could be things like MySQL cli query or a wget call to a webserver.

generally, I have one long script that validates a combination of these and as it runs through the script I echo the HTML, meta refresh tag. My table, my table row, then each “if” case appends a <TD> </TD></TR>with “else” statements adding a red or a green cell into an HTML file as it goes down the list.

That way if I have say, 50-100 critical things that run every morning I have a visual dashboard when one screws up.

As far as I know this is still in use 14 years after deployed. It’s in all the stuff that starts an options exchange every day.

And then I left behind another one at a telco that checks all their radius servers and radius partners and pops a cell red when one isn’t responding to auth requests and I “think” they are using some form of it. Other than now solar winds hooks into those exit codes and they don’t really care about the html page.

link

Bender 143 days ago

Backup script completes successfully but creates empty backup files

The cron job itself would need to do sanity checks on results. e.g. comparison of before / after directory sizes, file counts, perhaps a few canary files that never change and then alter the exit status based on all of those checks after performing some math logic as well as trigger monitoring alerts via your preferred mechanism. Your script can control the exit status. Some use functions that perform sanity checks, cleanup traps, etc... and with each failure add a number to '$?' assuming bash adding text output to the end of the script to describe the failures when calling the script in verbose mode.

In other words, whatever you the human did to realize there is a problem have the script perform the same checks as if it were you and alter the exit status and/or perform whatever other alerting methods are available to you.

If changing the exit status be sure the script is idempotent as some cron daemons may try to re-run the script depending on specific exit status. In other words if run a second consecutive time determine what you really want the script to do. Read up on the cron daemon you are using and how it interprets exit status and what it will do.

link

t-3 143 days ago

Validation checks are really the only solution if you can't fix the real problem - your processes are returning 0 when they are failing. Can you file a bug report?

link