|
|
|
|
|
by btilly
165 days ago
|
|
Thanks. It is nice to hear another perspective on this. But here is a question. How much of SEUs not being problems were because they weren't problems? Versus because there were solutions in place to mitigate the potential severity of that kind of problem? (The other problems that you name are harder to mitigate.) |
|
It's pretty challenging for software to defend against SEUs corrupting memory, especially when retrofitting an existing design like Linux. While operating Forge, we saw plenty of machines miscompute stuff, and we definitely worried about garbage getting into our caches. But my recollection is that the main cause was individual bad CPUs. We would reuse files in tmpfs for days without reverifying their checksums, and while we considered adding a scrubber, we never saw evidence that it would have caught much.
Maybe the CPU failures were actually due to radiation damage, but as they tended to be fairly sticky, my guess is something more like electromigration.