| This post reminded me of my time as a consulting systems support specialist. Lots of weird problem turned out to be bad hardware. Usually memory or disk, sometimes bad logic boards. For end users, this would often lead to complete freezing of the computer, so it was less likely to be blamed on broken software, but there were still many times it was hard to be sure. Desktop OS software can flake out in strange ways due to memory problems. I used to run a lot of memory tests as a matter of course. I think the title of the article could be more accurate, considering how much is devoted not to issues about software reliability per se, but to distinguishing between unreliable software and unreliable hardware. I think an implicit assumption in most discussions about software reliability is that the hardware has been verified. I personally do not think that it is the responsibility of a database to perform diagnostics on its host system, although I can sympathize with the pragmatic requirement. When I am determining the cause of a software failure or crash, the very first thing I always want to know is: is the problem reproducible? If not, the bug report is automatically classified as suspect. It's usually not feasible to investigate a failure that only happened once and cannot be reproduced. Ideally, the problem can be reproduced on two different machines. What we're always looking for when investigating a bug are ways to increase our confidence that we know the situation (or class of situation) in which the bug arises. And one way to do this is to eliminate as many variables as possible. As a support specialists trying to solve a faulty computer or program, I followed the same course: isolate the cause by a process of elimination. When everything else has been eliminated, whatever you are left with is the cause. I'm still all jonesed up for a good discussion about software reliability. antirez raised interesting questions about how to define software that is working properly or not. While I'm all for testing, there are ways to design and architect software that makes it more or less amenable to testing. Or more specifically, to make it easier or harder to provide full coverage. I've always been intrigued by the idea that the most reliable software programs are usually compilers. I believe that is because computer languages are amongst the most carefully specified kind of program input. Whereas so many computer programs accept very poorly specified kinds of input, like user interface actions mixed with text and network traffic, which is at higher risk of having ambiguous elements. (For all their complexity, compilers have it easier in some regards: they have a very specific job to do, and they only run briefly in batch operations, producing a single output from a single input. Any data mutations originate from within the compiler itself, not from the inputs they are processing.) In any case, I believe that the key to reliable programs depends upon the a complete and unambiguous definition of any and all data types used by those programs, as well as complete and unambiguous definitions of the legitimate mutations that can be made to those data types. If we can guarantee that only valid data is provided to an operation, and guarantee that each such operation produces only legitimate data, then we reduce the chances of corrupting our data. (Transactional memory is such an awesome thing. I only wish it was available in C family languages.) One of my crazy ideas is that all programs should have a "pure" kernel with a single interface, either a text or binary language interface, and this kernel is the only part that can access user data. Any other tool has to be built on top of this. So this would include any application built with a database back-end. I suppose that a lot of Hacker News readers, being web developers, already work on products featuring such partitioning. But for desktop software developers who work with their own in-memory data structures and their own disk file formats, it's not so common or self-evident. Then again, even programs that do rely on a dedicated external data store also keep a lot of other kinds of data around, which may not be true user data, but can still be corrupted and cause either crashes or program misbehaviour. In any case, I suspect that this is going to be an inevitable side-effect of various security initiatives for desktop software, like Apple's XPC. The same techniques used to partition different parts of a program to restrict their access to different resources often lead to also partitioning operations on different kinds of data, including transient representations in the user interface. Can a program like Redis be further decomposed into layers to handle tasks focussed on different kinds of data to achieve even better operational isolation, and thereby make it easier to find and fix bugs? |
I don't think this is necessarily true; I used to maintain the Delphi compiler, and there were hundreds of bugs in the backlog that never really got looked at owing to workarounds, low impact and high cost of fixing.
What compilers usually have going for them is that they are batch processes rather than online processes, so they don't have time to build up crud in data structures; they have highly reproducible inputs - code that causes a crash normally causes a crash every run of the program, no weird mouse clicks or timing needed, and this code can usually be sent back to the vendor; and all customer code is effectively a unit test, so feedback from betas etc. is immediate and loud.