Hacker News new | ask | show | jobs
by tylerhou 1831 days ago
Nit: if you're actually running multiple CPUs to mitigate data corruption, you need a third CPU to break ties.
3 comments

Only if your response is to continue. If your response is to mark the processor pair as faulty and take it out of service, two is sufficient. I worked on the kernel for a fault-tolerant system in 1991 based on this model plus checkpointed memory. The sales story was that such a design was more cost-effective than the "pair and spare" approach used by competitors like Tandem and Stratus.
But you only need two to detect an issue, assuming they don't fail the same way, though that's an issue in the mitigation department if you let it limp along.
What they do in IBM zSeries is they compare the instruction executions on two cores, if they agree. If they disagree, they will retry the operation. If they disagree again, they will take the CPU (both cores) offline (and probably call home to IBM to bring a replacement).