Hacker News new | ask | show | jobs
by fefe23 702 days ago
So on one hand they are saying it's voltage (i.e. something external, not their fault, bad mainboard manufacturers!).

On the other hand they are saying they will fix it in microcode. How is that even possible?

Are they saying that their CPUs are signaling the mainboards to give them too much voltage?

Can someone make sense of this? It reminds me of Steve Jobs' You Are Holding It Wrong moment.

6 comments

Saying "elevated voltage causes damage" is not attributing blame to anyone. In the very next sentence, they then attribute the reason for that elevated voltage to their own microcode, and so it is responsible for the damage. I literally do not know how they could be any clearer on that.
> Are they saying that their CPUs are signaling the mainboards to give them too much voltage?

Yes that's exactly what they said.

So it's a 737 MAX problem: the software is running a control loop that doesn't have deflection limits. So it tells the stabilizer (or voltage reg in this case) to go hard nose down.
lol what a stretch of an analogy
The voltage supplied by the motherboard isn't supposed to be constant. The CPU is continuously varying the voltage it's requesting, based primarily on the highest frequency any of the CPU cores are trying to run at. The motherboard is supposed to know what the resistive losses are from the VRMs to the CPU socket, so that it can deliver the requested voltage at the CPU socket itself. There's room for either party to screw up: the CPU could ask for too much voltage in some scenarios, or the motherboard's voltage regulation could be poorly calibrated (or deliberately skewed by overclocking presets).

On top of all this mess: these products were part of Intel's repeated attempts to move the primary voltage rail (the one feeding the CPU cores) to use on-die voltage regulators (DLVR). They're present in silicon but unused. So it's not entirely surprising if the fallback plan of relying solely on external voltage regulation wasn't validated thoroughly enough.

My guess is something like the following:

Modern CPU's are incredibly complex machines with a ridiculously large amount of possible configuration states (too large to exhaustively test after manufacture or sim during design), e.g. a vector multiply in flight with an AES encode in flight with x87 sincos, etc. Each operation is going to draw a certain amount of current. It is impractical to guarantee each functional unit with the required current but the supply rails are sized for a "reasonable worst case".

Perhaps an underestimate was mistakenly made somewhere and not caught until recently. Therefore the fix might be to modify the instruction dispatcher (via microcode) to guarantee that certain instruction configurations cannot happen (e.g. let the x87 sincos stall until the vector multiply is done) to reduce pressure on the voltage regulator.

It's worse than that, thermal management is part of the puzzle. Think of that as heat generation happening across three dimensions (X + Y + time) along with diffusion in 3D through the package.
It's an interesting idea, but there's a caveat: time flows in just one direction.
The claim seems to be that the microcode on the CPU is in certain circumstances requesting the wrong (presumably too high) voltage from the motherboard. If that is the case fixing the microcode will solve the issue going forward but won’t help people whose chips have already been damaged by excessive voltage.
The “you’re holding it wrong!”angle is all your take. They don’t make that claim.
"OK, great, let’s give everybody a case" lives on