Hacker News new | ask | show | jobs
by starspangled 701 days ago
All modern CPUs come out of the factory with many many bugs. The errata you see published are only the ones that they find after shipping (if you're lucky, they might not even publish all errata). Many bugs are fixed in testing and qualification before shipping.

That's how CPU design goes. The way that is done is by pushing as much to firmware as possible, adding chicken switches and fallback paths, and all sorts of ways to intercept regular operation and replace it with some trap to microcode or flush or degraded operation.

Applying fixes and workaround might cost quite a bit of performance (think spectre disabling of some kinds of branch predictors for an obvious very big one). And in some cases you even see in published errata they leave some theoretical correctness bugs unfixed entirely. Where is the line before accepting returns? Very blurry and unclear.

Almost certainly, huge parts of their voltage regulation (which goes along with frequency, thermal, and logic throttling) will be highly configurable. Quite likely it's run by entirely programmable microcontrollers on chip. Things that are baked into silicon might be voltage/droop sensors, temperature sensors, etc., and those could behave unexpectedly, although even then there might be redundancy or ways to compensate for small errors.

I don't see they "passed it off" as a microcode issue, just said that a microcode patch could fix it. As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue". Most things can be fixed with firmware/microcode patches, by design. And many things are. For example if some voltage sensor circuit on the chip behaved a bit differently than expected in the design but they could correct it by adding some offsets to a table, then the "issue" is that silicon deviates from the model / design and that can not be changed, but firmware update would be a perfectly good fix, to the point they might never bother to redo the sensor even if they were doing a new spin of the masks.

On the voltage issue, they did not say it was requesting an out of spec voltage, they said it was incorrect. This is not necessarily detectable out of context. Dynamic voltage and frequency scaling and all the analog issues that go with it are fiendishly complicated, voltage requested from a regulator is not what gets seen at any given component of the chip, loads, switching, capacitance, frequency, temperature, etc., can all conspire to change these things. And modern CPUs run as close to absolute minimum voltage/timing guard bands as possible to improve efficiency, and they boost up to as high voltages as they can to increase performance. A small bug or error in some characterization data in this very complicated algorithm of many variables and large multi dimensional tables could easily cause voltage/timing to go out of spec and cause instability. And it does not necessarily leave some nice log you can debug because you can't measure voltage from all billion components in the chip on a continuous basis.

And some bugs just take a while to find and fix. I'm not a tester per se but I found a logic bug in a CPU (not Intel but commercial CPU) that was quickly reproducible and resulted in a very hard lockup of a unit in the core, but it still took weeks to find it. Imagine some ephemeral analog bug lurking in a dusty corner of their operating envelope.

Then you actually have to develop the fix, then you have to run that fix through quite a rigorous testing process and get reasonable confidence that it solves the problem, before you would even make this announcement to say you've solved it. Add N more weeks for that.

So, not to say a dishonest or bad motivation from Intel is out of the question. But it seems impossible to make such speculations from the information we have. This announcement would be quite believable to me.

2 comments

I agree with most of what you said, so cherry picking one thingy to reply to isn't my intention, but

"And some bugs just take a while to find and fix."

I think it's less that it took awhile to find the bug/etc, more so that they've been pretty much radio silent for six months. AMD had the issue with burning 7 series CPUs, they were quick to at least put out a statement that they'll make customers whole again.

Well as it comes to Intel executive management and PR, I'm entirely unqualified to make any educated comment or speculation about it. I can't say I'm aware of Intel ever having great renown for its handling of product defects though.
Oh, I'm certainly the same, just some rando enjoying my popcorn.
> As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue

They claimed:

> a microcode algorithm resulting in incorrect voltage requests to the processor.

I was responding in context of OP's theory that their statement may not be entirely truthful.
The thing is, "incorrect" implies the existence of a static "correct". Which I interpret as a static spec which a microcode bug violated and could be fixed back to that static spec with a simple microcode update.

I do find your suggested scenario to be very plausible. That Intel have discovered their original voltage algorithm was flawed, leading to instability. And it is very feasible that simply updating the microcode is the correct fix for such an issue.

If Intel had explicitly stated that the original voltage algorithm spec was wrong, and the new one fixes the issue, I'd be pretty willing to believe them, and probably wouldn't have written that comment.

I'm not saying your integration of "incorrect voltage" as meaning "voltage that we now know causes instability" is wrong. It's an ambiguous statement and either interpretation is valid. But I have experience working with PR people, they know how to avoid ambiguous statements.

PR people are also experts at using ambiguous statements to their advantage. Crafting statements where not only are there multiple possible interoperation, but statements where the average reader will tend to interpret in the best possible way. I have experience in helping PR people to craft such statements. There are a few other examples of "ambitious statements" in that statement, which leads me to question the honesty of the whole thing.