Hacker News new | ask | show | jobs
by yuliyp 694 days ago
> In my experience with outages, usually the problem lies in some human error not following the process

Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

> what kind of bug? Could it have been prevented with proper testing or code review?

It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.

Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.

More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"

1 comments

> Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.

You can't reliably fix problems you don't understand.