|
|
|
|
|
by ahepp
650 days ago
|
|
I can certainly see that there are some fundamental issues with assigning software engineers the level of responsibility that structural engineers have. That said, from what I've heard, crowdstrike seems like a great example of something a hypothetical licensed software engineer should lose their license for. I admit I don't know all the details, but it seems that an update was pushed to prod that immediately broke all windows machines. Doesn't that mean they pushed an update to customers without testing it even a single time, on a single windows machine? I heard they even bypassed customer staging environments? I also find it interesting to consider what the future holds. A few possible paths seem like: 1) The state of the profession progresses to the point where we have enough widely recognized best practices to make licensure meaningful 2) We consider the benefits of rapid, cheap(er than the alternative), software production as being greater than the costs of crowdstrike level events, and change nothing 3) We adapt software system architectures on the customer side so that there's meaningful oversight and accountability inside an organization (in many ways enabling #1) |
|
None of that responsibility falls under software "engineering" specifically but actually under the broader scope of systems engineering, the problems you stated is about how different systems interacted in a failure case, not about how any individual system that any individual "engineer" worked on failed.
Is it as much Microsoft's fault that repeated bluescreens from a failing kernel driver didn't prompt the OS to stop loading said driver and try to boot?
Is the the engineer that wrote the faulty code's fault? Their EM? The PM who approved bypassing staging? Who is the one who should be investigated and fired, what if there are 100 people that touched the codebase in the last "sprint"?
This leads to accountability and liability, who should be held liable, the is literally the point of chief engineer, he is held liable, financially if possible and criminally if proven. Who is the "chief engineer" in your #1 hypothetical for a company and what are their qualifications and skill level? That's the real question, because we know the standards are not there, if you go and read the crowdstrike report you will find it was an out of bound access, the index passed in from another system. It's not statically verifiable and bounds checking at runtime with a crash (ala rust) would have still caused the crash. The only way to do that would be to place a manual bounds check before the call site, which has been best practice for decades and yet still isn't happening, so its an accountability thing, someone did a code review, probably gave a LGTM because the array has bounds checking which would catch an out of bounds read but didn't concider the fact that it crashing would bring down the host.