|
ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000. Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug. Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work. |
Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".