| HN Mirror

There was a very very extensive write-up from crowdstrike about what went wrong including how tests were passed if you are interested but I want to comment on your hypothetical series of events.

None of that responsibility falls under software "engineering" specifically but actually under the broader scope of systems engineering, the problems you stated is about how different systems interacted in a failure case, not about how any individual system that any individual "engineer" worked on failed.

Is it as much Microsoft's fault that repeated bluescreens from a failing kernel driver didn't prompt the OS to stop loading said driver and try to boot?

Is the the engineer that wrote the faulty code's fault? Their EM? The PM who approved bypassing staging? Who is the one who should be investigated and fired, what if there are 100 people that touched the codebase in the last "sprint"?

This leads to accountability and liability, who should be held liable, the is literally the point of chief engineer, he is held liable, financially if possible and criminally if proven. Who is the "chief engineer" in your #1 hypothetical for a company and what are their qualifications and skill level? That's the real question, because we know the standards are not there, if you go and read the crowdstrike report you will find it was an out of bound access, the index passed in from another system. It's not statically verifiable and bounds checking at runtime with a crash (ala rust) would have still caused the crash. The only way to do that would be to place a manual bounds check before the call site, which has been best practice for decades and yet still isn't happening, so its an accountability thing, someone did a code review, probably gave a LGTM because the array has bounds checking which would catch an out of bounds read but didn't concider the fact that it crashing would bring down the host.