|
|
|
|
|
by ydant
1433 days ago
|
|
healthchecks.io is a great service (and apparently can be self-hosted - https://github.com/healthchecks/healthchecks) that I use for both personal projects and at work. It works really well for cron jobs - while it works with a single call, you can also call a /start and finished endpoint and get extra insights such as runtime for your jobs. It would be nice if it had slightly more complex alerting rules available - for example, a "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise" type alert. We wanted to use it for monitoring some periodic downloads (like downloading partners' reports), and the expectation is the call will often time out or fail or have no data to download, which is technically a "failure", but only if it goes on for more than a day. Since healtchecks.io doesn't really support this, we ended up writing our own "stale data" monitoring logic and alerting inside the downloader, and just use healtchecks.io to monitor the script not crashing. |
|
This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.
I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.
One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?