|
|
|
|
|
by Arainach
2 hours ago
|
|
You can't prove that "this work caused us to not get paged" versus "that work is unnecessary and you wouldn't have been paged regardless". Even when you can, you can't prove the impact. As a real example, our team has extensive presubmit infrastructure to catch and block some classes of configuration error that have caused customer data corruption in the past. There have been CLs which were caught by those presubmits and meant that we didn't have outages, but there's no dollar amount tied to an outage that didn't exist. Meanwhile, team X did something similar that caused data corruption, had N customers affected for such a period of time, scrambled to root cause, roll back, and restore from backups, getting customers back up and online. Look how responsive and great they are! |
|
The impact is how many outages overall. If you only prevent one outage then maybe it's not that meaningful.
Your last paragraph, your right that happens in the short term. In the long term those teams get reputations for being a shit show, there will be high turnover, good engineers won't transfer in, people's compentaencies start to get questioned, other teams will avoid working with that team and develop their own solutions, and higher up people will start to look at what's going on.