| You are factually correct. However, availability isn't the problem ECC memory intends to solve. The problem with memory errors is that they are silent. You won't notice them until something goes misteriously wrong. And that can be anything, from the innocent invalid memory access to data corruption. This just can't be tolerated anywhere data is being processed, data you don't want to lose that is... RAID does nothing if the OS thinks that its in-memory filesystem datastructures are correct, and just goes ahead and updates the superblock with bad data, or writes over other files' pages. You just get a nice, redundant, corrupted filesystem. The same goes for multiple machines sharing data anywhere, filesystems or databases alike. It's the error detection part that's important, not the correction part. And ECC main memory is just a part of the picture, you want to be notified of errors as soon as possible. And this is the important bit: "be notified". So, you want parity checks and CRCs on disk caches and data buses and everywhere else it's feasible. It's not an accident that server-class hardware costs more than your average PC. The "correction" part is just a welcome by-product. I, for one, replace memory modules as soon as they trigger more than one ECC event. And this happens occasionally even with an universe of machines in the low dozens, with supposedly high-quality components. Now think what may be happening silently with all those borderline memory modules from anonymous manufacturers in China... Besides, like I mentioned before, it isn't easy to find non-ECC memory servers from the usual vendors. Only their very low-end machines have it. Machines that aren't meant to do anything more that shoving packets around or other usage patterns where either silent data corruption can be tolerated (easy to replace appliances that don't process/store important data) or checksums are already a part of the job (network stuff like firewalls or routers). |
I thought ECC events were triggered by environment, rather than hardware faults? Or you just figure some sticks are by chance more susceptible?