| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jerf 3323 days ago

Imagine that you had an oracle that you could feed code, and it would hand you back all possible bugs and their probability of the bugs manifesting in the real world. A "bug" here is some execution arbitrarily-long trace of events that results in an undesirable outcome, probably a crash. The "probability" here is drawn from the distribution of "real world events", which, along with the oracle magically finding all bugs, is another way the oracle is magic, since you don't have access to that distribution normally.

First, of course, you turn yourself into a multi-billionaire. But once you're bored with that, let's feed it the code to a device driver. Assuming that this is a driver that has had some decent work put into it and that you aren't getting something degenerate like a 100% chance of hitting a certain bug, you probably get back a list of bugs that almost certainly follows a power law distribution. If it's a good driver, the most likely bug may only be one in a trillion, but after that first bug it probably trails off for quite some time. For instance, you may find a bug that only applies once the 2^64-1th byte comes in AND you get unlucky on a scheduling operation or something. There's often a lot of these things even in incredibly well-tested code, such as: http://envisage-project.eu/proving-android-java-and-python-s...

Now suppose we characterize the challenge of writing a bit of code as A: the difficult of pushing the probability of the bugs down B: the impact of failure and C: how many rolls of the dice the real-world takes at the code. Easy-to-write code is code that is easy to write to push the probability of failure down below what I care about on the real world instances that I care about, and the impact of failure is not that big a deal. So the easiest code by this metric to write would be something like a one-off string massaging script written in bash shell where I'm going to examine the results by eye and fix it up anyhow. It's easy to get it right enough, failure just means I'm going to manually fix it up a bit, and the world only takes one shot at my code anyhow.

Why are device drivers hard? They are hard to write correctly enough to push the probability of bugs down, because they are running in a very constrained, yet highly-privileged environment, and often in an environment where it's very hard to test at all, let alone exhaustively test the state space of all possible states the hardware may be in and all transitions it might like to make next. Device drivers can, in the world case, potentially brick your device if they fail, but can certainly crash it or a major subsystem and cause massive data loss on the local system. And the real kicker is that a device driver is going to be exposed to the real world on millions upon millions of systems, potentially millions of times per second, for years and years on end. It's a situation where a sensible, sane programmer who might otherwise eschew all "that academic bullshit" might just find themselves reaching for formal verification tools.

Another example of something that is harder than you may think is "cloud code". Not code that uses the cloud, but the code that implements the cloud, like S3 or something. In principle a key-value store isn't that hard, but by the time you're done specifying that you want 99.9999999%+ reliability (not availability, reliability, i.e., even if the service sometimes goes down the data is still there when it comes back up), realizing that customers are putting crown jewels into your cloud server, and accounting for the trillions upon trillions of times that this service will be hit, the sheer expanse of complicated machinery executing it and the number of failure modes in the entire system from one bad bit of RAM up to correlated total datacenter failures, and suddenly something that looks hardly more complicated that putting a Python dictionary up on the web becomes a massive, massive engineering challenge.