|
|
|
|
|
by srhtftw
441 days ago
|
|
> Why are you getting paged? Because you built the system. There are at least two problems with this thinking. The main problem is it's not generally true. The system is created by the entire organization. The people who raise money and allocate capital, the people who set development policies and priorities, the people who design and assemble the components, the people who sell it to customers and negotiate service levels and the people who operate and maintain it all collectively built the system. Another problem is that it encourages moral hazards. Not paying fair on-call compensation allows unethical managers and sales staff to reap short-term rewards and bonuses by oversubscribing customers, promising more than can be delivered and rushing things to market before they're ready. If you want happy employees, treat them fairly. |
|
I see normally in oncall threads people complaining about "I got paged by an alerts because of another system X" - but in at least in a big enough organization this should not happen and it's an organizational failure. There should be an operations center on 24h/24h able to triage, escalate and evaluate, possibly not staffed only with L1 techs and given enough freedom to actually improve and automate. I know there are places where that is not true, and I ran away screaming from some in my career once I understood tech leadership had no understanding why it was needed.
But you would be surprised how much of the oncall pain is actually self inflicted by application teams themselves (some examples I encountered in the last year: TCP connect timeouts in the minutes and with no retries, no retry policies in general and things that should be idempotent that are not, no circuit breaker strategies, connection pools churning as they're shared between 10+ remote endpoints, wrong expectations about transaction isolation levels and how to handle conflicts at least in simple scenarios).