Hacker News new | ask | show | jobs
by starttoaster 648 days ago
The thing is, you can hold an engineer that designed a bridge accountable for designing a poor bridge that failed under load. The strength of materials, the load bearing ratings of a particular design for a bridge, can all be calculated and are well known. How do you come up with the same types of calculations for software? You really can't, because writing software is more about designing a solution to problems people aren't already solving. You can write tests for the problems that you are skilled enough to anticipate, or even tests to cover problems you discovered in hindsight of a regression, and sure you can hold software engineers accountable for choosing to bypass the tests they write. But how do you hold a software engineer accountable for a failure mode that nobody considered already? It's like holding the very first person to design a bridge accountable for making a bad bridge, let's see you make a better one.

For that reason, I don't think we'll see software engineers accepting the same level of responsibility for their errors as structural engineers.

5 comments

I can certainly see that there are some fundamental issues with assigning software engineers the level of responsibility that structural engineers have.

That said, from what I've heard, crowdstrike seems like a great example of something a hypothetical licensed software engineer should lose their license for. I admit I don't know all the details, but it seems that an update was pushed to prod that immediately broke all windows machines. Doesn't that mean they pushed an update to customers without testing it even a single time, on a single windows machine? I heard they even bypassed customer staging environments?

I also find it interesting to consider what the future holds. A few possible paths seem like:

1) The state of the profession progresses to the point where we have enough widely recognized best practices to make licensure meaningful

2) We consider the benefits of rapid, cheap(er than the alternative), software production as being greater than the costs of crowdstrike level events, and change nothing

3) We adapt software system architectures on the customer side so that there's meaningful oversight and accountability inside an organization (in many ways enabling #1)

There was a very very extensive write-up from crowdstrike about what went wrong including how tests were passed if you are interested but I want to comment on your hypothetical series of events.

None of that responsibility falls under software "engineering" specifically but actually under the broader scope of systems engineering, the problems you stated is about how different systems interacted in a failure case, not about how any individual system that any individual "engineer" worked on failed.

Is it as much Microsoft's fault that repeated bluescreens from a failing kernel driver didn't prompt the OS to stop loading said driver and try to boot?

Is the the engineer that wrote the faulty code's fault? Their EM? The PM who approved bypassing staging? Who is the one who should be investigated and fired, what if there are 100 people that touched the codebase in the last "sprint"?

This leads to accountability and liability, who should be held liable, the is literally the point of chief engineer, he is held liable, financially if possible and criminally if proven. Who is the "chief engineer" in your #1 hypothetical for a company and what are their qualifications and skill level? That's the real question, because we know the standards are not there, if you go and read the crowdstrike report you will find it was an out of bound access, the index passed in from another system. It's not statically verifiable and bounds checking at runtime with a crash (ala rust) would have still caused the crash. The only way to do that would be to place a manual bounds check before the call site, which has been best practice for decades and yet still isn't happening, so its an accountability thing, someone did a code review, probably gave a LGTM because the array has bounds checking which would catch an out of bounds read but didn't concider the fact that it crashing would bring down the host.

> Is it as much Microsoft's fault that repeated bluescreens from a failing kernel driver didn't prompt the OS to stop loading said driver and try to boot?

Nit. Windows does have something that does this. Failing kernel drivers are excluded on reboot. But Crowdstike marked the Falcon(?) driver in some way that prevents booting without it, even in safe mode. After all, being able to force a boot without your antivirus system isn’t safe, so why allow it?

Countee nit. It is a WHQL approved driver. Microsoft validated it to do that.

It's all hypothetical regardless, the point is that there are so many people involved in that specific failure and if they really wanted to investigate it they will likely find some best practice was followed and the failure occurred anyway

"Real" engineers do new things too, where there may be un-anticipated failure modes, and where the answer can't just be looked up from a book of standards. Things like boring the world's longest tunnel under a mountain with sparsely available geological samples, building passenger trains that beat the world speed record, building reusable space rockets, and so on. Software engineers aren't the only ones solving novel and complex problems, and failure is sometimes understandable.

You don't lose your license if you fail at solving a hard problem, where nobody has succeeded before. But to be granted the responsibility to attempt those problems you have to demonstrate experience, education, and competence. You lose your license if you demonstrate disregard for ethics, basic safety standards, recordkeeping, and so on. I don't see why software engineering can't, in principle, have a similar level of professionalism, especially when critical systems are being built on top of it. But it would strongly conflict with the ethos of anarchy, moving-fast-and-breaking-things, and autodidact garage hacker culture that permeates the field (and which software engineering has greatly benefitted from).

In practice, engineers are not sent to jail only if the bridge they designed falls. They are sent to jail or punished otherwise if it is shown in a court or similar investigative process that they intentionally or unintentionally failed to follow standards or reasonable foreseeable precautions.

The same can very easily and must exist for software engineers. Some body comes up with a set of safety and security standards, and all the licensed engineers ensure the software under their wing conforms to those standards.

Software engineers will accept this responsibility when eventually governments pass laws regarding this, as they have done for other professions. It just takes time.

Software is simply too complex. The complexity is mitigated with abstraction, but the abstracted code quickly becomes complex again, and it once again gets abstracted.

Layer this a bunch of times and you get to where the regular SWE is working.

However, that doesn't mean we cannot have bug-free code. It just means that code would have to be written close to hardware, likely by a team fluent in both hardware execution and software architecture.

Using phased rollouts of updates has been a thing for well over a decade at this point. Microsoft uses phased rollouts for windows updates. Google does the same for chrome. To say nothing of proper testing and fuzzing.

It’s not a mystery why crowdstrike brought down so many companies and got people killed. Their engineering practices were foreseeable bad and people died as a result. Wah wah software is complicated. So what? That’s why you learn to do it properly before you install your software in hospitals and airports.

Learn to take responsible for the outcome of your work. Software is complex, yes. Keeping that complexity in check is what software engineers are trained for and hired for.