| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by TeeWEE 889 days ago
	Connect your status page to actual metrics and decide a treshold for downtime. Boom you’re done.

2 comments

sjsdaiuasgdia 889 days ago

Stage 1: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

Problems: Delayed or missed updates. Customers complain that you're not being honest about outages.

Stage 2: Status is automatically set based on the outcome of some monitoring check or functional test.

Problems: Any issue with the system that performs the "up or not?" source of truth test can result in a status change regardless of whether an actual problem exists. "Override automatic status updates" becomes one of the first steps performed during incident response, turning this into "status is manually set, but with extra steps". Customers complain that you're not being honest about outages and latency still sucks.

Stage 3: Status is automatically set based on a consensus of results from tests run from multiple points scattered across the public internet.

Problems: You now have a network of remote nodes to maintain yourself or pay someone else to maintain. The more reliable you want this monitoring to be, the more you need to spend. The cost justification discussions in an enterprise get harder as that cost rises. Meanwhile, many customers continue to say you're not being honest because they can't tell the difference between a local issue and an actual outage. Some customers might notice better alignment between the status page and their experience, but they're content, so they have little motivation to reach out and thank you for the honesty.

Eventually, the monitoring service gets axed because we can just manually update the status page after all.

Stage 4: Status is manually set. There may be various metrics around what requires an update, and there may be one or more layers of approval needed.

link

lrem 889 days ago

Does anyone serious do this?

That’s an honest question, from a pretty experienced SRE.

link

darkwater 889 days ago

In a world of unicorns and rainbows, absolutely. In the real world, it's as you probably already know: it's not that easy in a complex enough system.

Quick counter-example for GP: what if the 500 spike is due to a spike in malformed requests from a single (maybe malicious) user?

link

laeri 889 days ago

A malformed request should not lead to a 500, they should be handled and validated.

link

darkwater 889 days ago

Well, in the real world it might. It should trigger a bug creation and a fix to the code, but not an incident. Now all of a sudden to decide this you need more complex and/or specific queries in your monitoring system (or a good ML-based alert system), so complexity is already going up.

link

laeri 888 days ago

Query input validation is nearly a solved problem. If you don't I would argue this is an incident if in this case 500's are returned.

link

jabradoodle 889 days ago

You need to validate your inputs and return 4xx

link

darkwater 889 days ago

Yeah and you also shall not write bugs in your code. Real world has bugs, even trivial ones.

link

jon_adler 889 days ago

True, however it also doesn’t impact other users and doesn’t justify reporting an incident on the status page.

link

tazjin 889 days ago

https://www.buildkitestatus.com/

link