| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ydant 1481 days ago

healthchecks.io is a great service (and apparently can be self-hosted - https://github.com/healthchecks/healthchecks) that I use for both personal projects and at work.

It works really well for cron jobs - while it works with a single call, you can also call a /start and finished endpoint and get extra insights such as runtime for your jobs.

It would be nice if it had slightly more complex alerting rules available - for example, a "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise" type alert.

We wanted to use it for monitoring some periodic downloads (like downloading partners' reports), and the expectation is the call will often time out or fail or have no data to download, which is technically a "failure", but only if it goes on for more than a day. Since healtchecks.io doesn't really support this, we ended up writing our own "stale data" monitoring logic and alerting inside the downloader, and just use healtchecks.io to monitor the script not crashing.

2 comments

cuu508 1480 days ago

> "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise"

This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.

I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.

One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?

link

ydant 1477 days ago

Thank you for the response!

I saw you make that suggestion on this issue - https://github.com/healthchecks/healthchecks/issues/525#issu...

----

Thinking about it, this does solve the issue as I described it. I do like being able to distinguish the states:

  - started, but never finished (no error reported)
  - started, and finished with error reported ("crash") (need immediate alert)
  - finished (without crashing), but not 100% successful (data not fetched)
  - finished successfully

As you mention, it makes sense to have the alerts be:

  - no successful completion (regardless of number of attempts) within X time
  - explicit error occurred

I think your /log approach does have the advantage of allowing for still having an explicit error alert regardless of duration - a critical error "alert NOW!" state.

The only (weak) argument against this approach that I see (and this is an argument for putting this as a configuration option) - is that the reason I started using HealthChecks.io is because it's incredibly simple to set up for a cron job. Moving this logic to the client means slightly more complicated error handling logic to call the right endpoint for which type of failure.

The counter-argument is by the time you move from calling just "/success" to calling multiple endpoints, you're already in that position of more complicated client-side logic. If you want the simple "just run at least once every X hours" approach, then all you need to do is never call "fail" and set the grace period appropriately.

For our use-case, our logic for when to alert/not got much more complicated than described so the move to doing the rules in our code still made sense, but I think there are some other instances where we'd benefit from your proposal.

link

amzans 1481 days ago

Healthchecks is a great service!

Not sure if you tried it too but https://cronitor.io/ supports more complex alerting rules like the one you describe.

As a bonus, you can also create uptime checks and status pages under the same roof.

Full-disclosure: I work at Cronitor, happy to help if you have any questions :)

link