Ideally incident handling should "just" be rolling back the broken change. Fixing the problem should be done in the morning with no time pressure, not in the middle of the night half asleep with customers on the other side of the world yelling at you. Of course it's not always that simple, but most of the time that's what on call should be about
It would be nice if things only broke during "business" hours and didn't have real world impact. Nevermind impact millions of people around the world. But if you look at the customers of say code that is running cloud infrastructure it is running airlines reservations/checkins, government workloads, banks, hospitals, critical infrastructure, netflix, gaming services. That's a lot of things that can't typically wait for morning.
This is the pat answer Amazon gives to defend this absurd practice, but it breaks down really easily.
>If your code breaks something, you should fix that code. Who else should?
What if it wasn't my code, but code written by someone 3 years ago who quit because most people only work at the company for 2 years? And it's in a part of the codebase I've never touched. That's a much more likely scenario.
The problem is that he has a big pile of half-working spaghetti code that he never has time to touch except when it malfunctions in the middle of the night
The problem behind that is that Amazon is a completely dysfunctional corporate hellscape. Like TFA said, you just don't have time or resources to actually fix things
You should join Amazon and do that, and you can come back here and apologize in a couple of years when you get pipped for wasting too much time on legacy code