Hacker News new | ask | show | jobs
by dimtion 1831 days ago
In the Facebook bug listed in the post this specific mitigation would probably not have been enough since the bug was due to invalid instructions emitted by the JIT.

Under exact same workloads, two duplicate JIT running on two duplicate CPU would have most likely emitted the same erroneous code.

2 comments

> due to invalid instructions emitted by the JIT.

That's not how I understood the blog post.

> Next they needed to understand the specific sequence of instructions causing the corruption. This turned out to be as much of a nightmare as anything else in the story. The application, like most similar applications in hyperscale environments, ran in a virtual machine that used Just-In-Time compilation, rendering the exact instruction sequence inaccessible. They had to use mutiple tools to figure out what the JIT compiler was doing to the source code, and then finally achieve an assembly language test:

>> The assembly code accurately reproducing the defect is reduced to a 60-line assembly level reproducer. We started with a 430K line reproducer and narrowed it down to 60 lines.

It sounds like the JIT produced accurate (although hard to find) machine code. Then when the CPU ran that machine code it executed it incorrectly, but only when executed on core 59.

Nit: if you're actually running multiple CPUs to mitigate data corruption, you need a third CPU to break ties.
Only if your response is to continue. If your response is to mark the processor pair as faulty and take it out of service, two is sufficient. I worked on the kernel for a fault-tolerant system in 1991 based on this model plus checkpointed memory. The sales story was that such a design was more cost-effective than the "pair and spare" approach used by competitors like Tandem and Stratus.
But you only need two to detect an issue, assuming they don't fail the same way, though that's an issue in the mitigation department if you let it limp along.
What they do in IBM zSeries is they compare the instruction executions on two cores, if they agree. If they disagree, they will retry the operation. If they disagree again, they will take the CPU (both cores) offline (and probably call home to IBM to bring a replacement).