Hacker News new | ask | show | jobs
by compumike 1159 days ago
Anyone who has done real engineering would realize that this problem a consequence of a design flaw with PagerDuty (or other alternatives with a similar API, where alerting is only triggered directly by a webhook).

If your design requires that the alerting service can receive a one-off affirmative "something's broken" packet, then yes, you are asking an inherently unreliable distributed system (i.e. the Internet!) to reliably deliver a critical message at a time when you know something is broken. Good luck. :)

Instead, if you use something like a periodic heartbeat (also known as a dead man's switch, inbound liveness monitor, or outbound HTTP probe -- all of which we support at Heii On-Call https://heiioncall.com/ out of the box), you can tolerate some occasional lost messages, regardless of whose end they are on.

Real reliable systems (for example, embedded systems) use periodic heartbeats and watchdogs, and are usually designed to be lenient to the occasional missed heartbeat. If the system being monitored is truly down, then enough consecutive heartbeats will be missed that some threshold is reached and the on-call person can be alerted (or a watchdog timer can reboot a system, etc).

1 comments

Also, the system at google is not in the path of the first page (that is direct from the alert infra), the more complex system is only needed for escalation.