|
|
|
|
|
by anomaloustho
220 days ago
|
|
It’s already been said, but most
companies already have those instant “alarms” that go off within minutes. 80% of the time, those alarms are red herrings that get triaged. At a lot of companies, they go off constantly. As a company, you don’t want to declare an outage readily and you definitely don’t want it to be declared frequently. Declaring an outage frequently means: • Telling your exec team that your department is not running well
• Negative signal to your investors
• Bad reputation with your customers
• Admitting culpability to your customers and partners (inviting lawsuits and refunds)
• Telling your engineering leadership team that your specific team isn’t running well
• Messing up your quarterly goals, bonuses etcetera for outages that aren’t real So every social and incentive structure along the way basically signals that you don’t want to declare an outage when it isn’t real. You want to make sure you get it right. Therefore, you don’t just want to flip a status page because a few API calls had a timeout. |
|
I would argue that every social and incentive structure along the way basically signals that you don't want to declare an outage, even when it is real. You should still do it though or it becomes meaningless.
Great example for Goodhart's law.