| I've been dealing with a frustrating problem: my cron jobs return exit code 0, but the results are wrong. Examples:
Backup script completes successfully but creates empty backup files
Data processing job finishes but only processes 10% of records
Report generator runs without errors but outputs incomplete data
Database sync completes but the counts don't match
The logs show "success" — exit code 0, no exceptions — but the actual results are wrong. The errors might be buried in logs, but I'm not checking logs proactively every day. I've tried:
Adding validation checks in scripts (e.g., if count < 100: exit 1) — works, but you have to modify every script, and changing thresholds requires code changes
Webhook alerts — requires writing connectors for every script
Error monitoring tools (Sentry, etc.) — they catch exceptions, not wrong results
Manual spot checks — not scalable The validation-in-script approach works for simple cases, but it's not flexible. What if you need to change the threshold? What if the file exists but is from yesterday? What if you need to check multiple conditions? You end up mixing monitoring logic with business logic. I built a simple monitoring tool that watches job results instead of just execution status. You send it the actual results (file size, record count, status, etc.) and it alerts if something's off. No need to dig through logs, and you can adjust thresholds without deploying code. How do you handle this? Are you adding validation to every script, proactively checking logs, or using something that alerts when results don't match expectations? What's your approach to catching these "silent failures"? |
File missing:
if [ ! -f /tmp/file ]; then exit 1 fi
File doesn’t contain 100 lines:
COUNT=`cat /tmp/file |wc -l` if [ $COUNT -lt "100"]; then exit 1 fi
File doesn’t contain a known header or record :
HEADER=`egrep -c SOME_CSV_VALUE /tmp/file` if [ $HEADER -eq "0"]; then exit 1 fi
any of those could be things like MySQL cli query or a wget call to a webserver.
generally, I have one long script that validates a combination of these and as it runs through the script I echo the HTML, meta refresh tag. My table, my table row, then each “if” case appends a <TD> </TD></TR>with “else” statements adding a red or a green cell into an HTML file as it goes down the list.
That way if I have say, 50-100 critical things that run every morning I have a visual dashboard when one screws up.
As far as I know this is still in use 14 years after deployed. It’s in all the stuff that starts an options exchange every day.
And then I left behind another one at a telco that checks all their radius servers and radius partners and pops a cell red when one isn’t responding to auth requests and I “think” they are using some form of it. Other than now solar winds hooks into those exit codes and they don’t really care about the html page.