Hacker News new | ask | show | jobs
by cuu508 1437 days ago
> "this service should run successfully at least once every X hours, but is fine to fail multiple times otherwise"

This should work if you set the period to "X hours", and send success signals only, no failure signals. In that case, as long as the gap between the success signals is below X hours, all is well. When there's been no success signal for more than X hours, Healthchecks sends out alerts.

I'm guessing you probably also want to log failures using the /fail endpoint. And, the problem is, when Healthchecks receives a failure event, it sends out alerts immediately.

One potential feature I'm considering is a new "/log" endpoint. When a client pings this endpoint, Healthchecks would treat it as neither a success nor a failure, and just log the received data. You could then use this endpoint in place of /fail. Just logging the failure would not cause any immediate alerts. But the information would be there for inspection, when X hours passes with no success signals and you eventually do get alerted. How does that sound?

1 comments

Thank you for the response!

I saw you make that suggestion on this issue - https://github.com/healthchecks/healthchecks/issues/525#issu...

----

Thinking about it, this does solve the issue as I described it. I do like being able to distinguish the states:

  - started, but never finished (no error reported)
  - started, and finished with error reported ("crash") (need immediate alert)
  - finished (without crashing), but not 100% successful (data not fetched)
  - finished successfully
As you mention, it makes sense to have the alerts be:

  - no successful completion (regardless of number of attempts) within X time
  - explicit error occurred
I think your /log approach does have the advantage of allowing for still having an explicit error alert regardless of duration - a critical error "alert NOW!" state.

The only (weak) argument against this approach that I see (and this is an argument for putting this as a configuration option) - is that the reason I started using HealthChecks.io is because it's incredibly simple to set up for a cron job. Moving this logic to the client means slightly more complicated error handling logic to call the right endpoint for which type of failure.

The counter-argument is by the time you move from calling just "/success" to calling multiple endpoints, you're already in that position of more complicated client-side logic. If you want the simple "just run at least once every X hours" approach, then all you need to do is never call "fail" and set the grace period appropriately.

For our use-case, our logic for when to alert/not got much more complicated than described so the move to doing the rules in our code still made sense, but I think there are some other instances where we'd benefit from your proposal.