Hacker News new | ask | show | jobs
by SkyPuncher 699 days ago
My understanding is they basically deployed a configuration file. It seems like these files might be akin to virus signatures or other frequently updated run-time configuration.

I actually don't think it's outrageous that these files are rolled out globally, simultaneously. I'm guessing they're updated frequently and _should_ be largely benign.

What stands out to me is the fact that a bad config file can crash the system. No rollback mechanism. No safety checks. No safe, failure mode. Just BSOD.

Given the fix is simply deleting the broken file, it's astounding to me that the system's behavior is BSOD. To me, that's more damning that a bad "software update". These files seem to change often and frequently. Given they're critical path, they shouldn't have the ability to completely crash the system.

3 comments

> I actually don't think it's outrageous that these files are rolled out globally, simultaneously.

Anyone competent that manages software at scale should generally hold the opposite opinion to this.

That’s the danger of running in kernel mode. I’ve seen some people claim this is because the bad file starts a chain of events which concludes in trying to page an unpageable file, which is an application crash in user space but brings down the whole system if it happens in the kernel.
That seems like programming 101 for these systems.

In the past, I've worked around this by validating the configuration of a file before attempting to run it. You bail out in a safe way during validation, but still allow a hard error during run time.

Doesn't prevent all misconfigured files, but prevents the stuff like.

I think it was in the early 90s when I first saw something do A/B style loading where it would record the attempt to load something, recognize that it hadn’t finished, and use the last known good config instead. Anyone studying high-availability systems has a wealth of prior art to learn from.
I think all programmers should have the experience of using and developing on a single-address-space OS with absolutely no protections like DOS, just to encourage them to improve their skills at writing better, actually correct code. When the smallest bugs will crash your system and cause you to lose work, you tend to be a lot more careful with thinking about what your code does instead of just running it to see what happens.
Suggesting “Being more careful” never solves these issues because eventually someone somewhere will have a momentary slip up that causes this.

The real takeaway is that we need to design systems so this kind of issue is less possible. Put less code in the kernel, use tools that prevent these kinds of issues, design computers that can roll back the system if they crash.

Perfect example of where instrumentation guided fuzzing like AFL would almost certainly have found a problem.

I agree with the amateur hour observation. But then most things seem to be.

Entertainingly enough I got to see a similar thing happen, where a configuration file was killing hardware in the field. After the failure and remediation multiple CI jobs were put in place (some months later) to do basic validity checks on the files.

The lesson of "multiple parser implementations for the same thing seems bad" and "sanity checks to prevent breaking things are hard heuristics to define" such that further changes were deferred.

All that to say that I can appreciate circumstances in which satisfying "don't crash the system" in response to configuration data can actually be fairly hard to realize. It can very significantly depend on the design of the pieces in question. But I also agree that it's pretty damning.