|
|
|
|
|
by anonu
697 days ago
|
|
In my experience with outages, usually the problem lies in some human error not following the process: Someone didn't do something, checks weren't performed, code reviews were skipped, someone got lazy. In this post mortem there are a lot of words but not one of them actually explains what the problem was. which is: what was the process in place and why did it fail? They also say a "bug in the content validation". Like what kind of bug? Could it have been prevented with proper testing or code review? |
|
Everyone makes mistakes. Blaming them for making those mistakes doesn't help prevent mistakes in the future.
> what kind of bug? Could it have been prevented with proper testing or code review?
It doesn't matter what the exact details of the bug are. A validator and the thing it tries to defend being imperfect mates is a failure mode. They happened to trip that failure mode spectacularly.
Also saying "proper testing and code review" in a post-mortem is useless like 95% of the time. Short of a culture of rubber-stamping and yolo-merging where there is something to do, it's a truism that any bug could have been detected with a test or caught by a diligent reviewer in code review. But they could also have been (and were) missed. "git gud" is not an incident prevention strategy, it's wishful thinking or blaming the devs unlucky enough to break it.
More useful as follow-ups are things like "this type of failure mode feels very dangerous, we can do something to make those failures impossible or much more likely to be caught"