|
|
|
|
|
by ChrisInEdmonton
1159 days ago
|
|
I used PagerDuty for more than a decade at my previous job. I didn't care much for the UI. But you know why PagerDuty does so well? Basically bulletproof reliability. 99% uptime won't cut it. 99.9% uptime won't cut it. You need to be as close as possible to 100% uptime, no excuses. Pagerduty isn't perfect, but it was one of the most reliable services we ever used. I sincerely wish you luck with allquiet. I just want to make very sure you are aware why people still pay for Pagerduty. To compete, you need to be looking at 99.99% uptime or better (ideally 99.999%, 5 minutes of downtime a year) where 'uptime' is defined as the ability to exercise the entire notification stack. The moment someone's site has an outage and you aren't able to deliver the notification, you lose the customer and everyone they talk to. I also worry about in-app notifications, but that's well-covered by everyone else's comments. Pagerduty is vulnerable. Their UI is garbage. But you need to have bullet-proof uptime to take them down. It's a tough challenge and I wish you luck! |
|
If your design requires that the alerting service can receive a one-off affirmative "something's broken" packet, then yes, you are asking an inherently unreliable distributed system (i.e. the Internet!) to reliably deliver a critical message at a time when you know something is broken. Good luck. :)
Instead, if you use something like a periodic heartbeat (also known as a dead man's switch, inbound liveness monitor, or outbound HTTP probe -- all of which we support at Heii On-Call https://heiioncall.com/ out of the box), you can tolerate some occasional lost messages, regardless of whose end they are on.
Real reliable systems (for example, embedded systems) use periodic heartbeats and watchdogs, and are usually designed to be lenient to the occasional missed heartbeat. If the system being monitored is truly down, then enough consecutive heartbeats will be missed that some threshold is reached and the on-call person can be alerted (or a watchdog timer can reboot a system, etc).