Hacker News new | ask | show | jobs
by phire 702 days ago
I find it hard to believe that it actually is a microcode issue.

Mostly because Intel has way too much motivation to pass it off as a microcode issue, as they can fix a microcode issue for free, by pushing out a patch. If it's an actual hardware issue, then Intel will be forced to actually recall all the faulty CPUs, which could cost them billions.

The other reason, is that it took them way too long to give details. If it's as simple as a buggy microcode requesting an out-of-spec voltage from the motherboard, they should have been able to diagnose the problem extremely quickly and fix it in just a few weeks. They would have detected the issue as soon as they put voltage logging on the motherboard's VRM. And according to some sources, Intel have apparently been shipping non-faulty CPUs for months now (since April, from memory), and those don't have an updated microcode.

This long delay and silence feels like they spent months of R&D trying to create a workaround, create a new voltage spec to provide the lowest voltage possible. Low enough to work around a hardware fault on as many units as possible, without too large of a performance regression, or creating new errors on other CPUs because of undervolting.

I suspect that this microcode update will only "fix" the crashes for some CPUs. My prediction is that in another month Intel will claim there are actually two completely independent issues, and reluctantly issue a recall for anything not fixed by the microcode.

6 comments

As I understand it, there are multiple voltages inside the CPU, so just monitoring the motherboard VRM won't cut it.

That said I too am very skeptical. I just issued a moratorium on the purchase of anything Intel 13th/14th gen in our company and waiting for some actual proof that the issue is fully resolved.

It's complicated.

On Raptor lake, there are a few integrated voltage regulators to which provide new voltages for specialised uses (like the E core's L2 cache, parts of DDR memory IO, PCI-E IO), but the current draw on those regulators is pretty low. The bulk of the power comes directly from motherboard VRMs on one of several rails with no internal regulation. Most of the power draw is grouped onto just two rails, VccGT for the GPU, and VccCore (also known as VccIA in other generations) which powers all the P-cores, all the E-cores and, the ring bus and the last-level cache.

Which means all cores share the same voltage, and it's trivial to monitor externally.

I guess it's possible the bug could be with only of the integrated voltage regulators, but those seem to only power various IO devices, and I struggle to see how they could trigger this type of instability.

What's special about the E core's L2 cache such that it gets on-chip regulated voltage?
I suspect it's for one of the low power modes.

Keep in mind that the L2 cache is the last level cache for the E cores, and is shared by the entire cluster of four E cores. (One of the two clusters connects to the ring bus and shares the main L3, the other goes directly to main memory)

I'm guessing Intel can shut down VccCore entirely (which wipes every other cache), while keeping just enough voltage to maintain the E core L2 cache. By keeping valid data in L2, they can resume execution on an E core much quicker.

And as long as the reason for waking is a small periodic housekeeping task, they don't even need to wake up main memory. All the data fits in the 2MB of L2 cache. This makes resuming even faster and saves even more power. Finally, quick resumes allow the task to complete quicker and shut down VccCore again, which saves even more power.

This extreme level of power saving isn't really useful for desktops, but very useful for laptops and tablets. BTW, I'm not talking about a sleep mode here, the CPU will ideally be able to enter this mode anytime there is no tasks to run for at least the next millisecond, so it can save power even when the user is actively using the system.

It's most likely both a hardware issue and a microcode issue.

Making CPUs is kind-of like sorting eggs. When they're made, they all have slightly different characteristics and get placed into bins (IE, "binned") based on how they meet the specs.

To oversimplify, the cough "better" chips are sold at higher prices because they can run at higher clock speeds and/or handle higher voltages. If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.

In this case, this is most likely an edge case that would not be a defect if shipping microcode already handled it. (Although it is appropriate to ask if it would result in effected chips going into a lower-price bin if they are effected.)

> If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.

Do you mean that if a 13900KS CPU has a manufacturing defect, it gets downgraded and sold as 13900F or something else according to the nature of the defect?

It's way more extreme than that.

For any named product (such as Raptor Lake) intel only make 1-3 unique silicon dies. Any product in the Alder Lake only had two dies, 8P+8E and 6P+0E [1]. Every single SKU comes from those two dies, if it has E cores, it's the 8P+8E die. Which means Alder Lake-N is actually the 8P+8E dies with all the P cores disabled.

The laptop versions, Alder Lake-P (20w) and Alder Lake-U (9 and 15w) are also the 8P+8E die, they couldn't use the 6P+0E die, because it has no E cores at all.

Raptor Lake is only one die with 8 P cores and 16 E cores, which they sell as every i9 and i7, along with the two top i5 designs. In the 13th generation, the remaining i5s are the Alder Lake 8P+8E die and the i3s are all Alder Lake 6P+0E dies.

The manufacturing defects aren't binary, it's not a simple pass/fail. It's all very analog: Some dies are simply able to reach higher clock speeds, or use more or less power. They test every single die and bin it based on its capabilities. The ones with the best power consumption go to the P and U SKUs. The ones which can reach the highest clock speeds are labeled as 13900KS, dies which just miss that get sold as 13900K, the rest get spread over all remaining SKUs based on their capabilities.

Intel couldn't decide to exclusively make 13900KS dies if they wanted to, because they are simply the top 0.1% of dies. They are forced to make 1000 dies, use the best one and sell the rest as lower SKUs.

[1] Wikichip has photos of the two dies: https://en.wikichip.org/wiki/intel/microarchitectures/alder_...

It's been almost 20 years since I worked in the industry, so I don't want to make assumptions about specific products.

When I was in the industry, it would be things like disabling caches, disabling cores, ect. I don't remember specific products, though.

Likewise, some die can handle higher voltages, clock speeds, ect.

Yes. It’s called the silicon lottery.
Silicon lottery was when you as a customer could get dies of varying degrees, some of which could be clocked higher than others. For the manufacturer it's not a lottery at all because the scales make the yields for various bins mostly predictable. Binning also means that you as a customer are much less likely to get a chip that is significantly better than specced although it still happens when chips sold as a lower bin for market segmentation purposes.
The months of R&D to create a workaround could simply be because the subset of motherboards which trigger this issue are doing something borderline/unexpected with their voltage management, and finding a workaround for that behaviour in CPU microcode is non-trivial. Not all motherboard models appear to trigger the fault, which suggests that motherboard behaviour is at least a contributing factor to the problem.
I think this issue was sort of cracked-open and popularized recently by this particular video from Level1Techs: https://www.youtube.com/watch?v=QzHcrbT5D_Y

Towards the middle of the video it brings up some very interesting evidence, from online game server farms that use 13900 and 14900 variants for their high single-core performance for the cost, but with server-grade motherboards and chipsets that do not do any overclocking, and would be considered "conservative". But these environments show a very high statistical failure rate for these particular CPU models. This suggests that some high percentage of CPUs produced are affected, and it's long run-time over which the problem can develop, not just enthusiast/gamer motherboards pushing high power levels.

All modern CPUs come out of the factory with many many bugs. The errata you see published are only the ones that they find after shipping (if you're lucky, they might not even publish all errata). Many bugs are fixed in testing and qualification before shipping.

That's how CPU design goes. The way that is done is by pushing as much to firmware as possible, adding chicken switches and fallback paths, and all sorts of ways to intercept regular operation and replace it with some trap to microcode or flush or degraded operation.

Applying fixes and workaround might cost quite a bit of performance (think spectre disabling of some kinds of branch predictors for an obvious very big one). And in some cases you even see in published errata they leave some theoretical correctness bugs unfixed entirely. Where is the line before accepting returns? Very blurry and unclear.

Almost certainly, huge parts of their voltage regulation (which goes along with frequency, thermal, and logic throttling) will be highly configurable. Quite likely it's run by entirely programmable microcontrollers on chip. Things that are baked into silicon might be voltage/droop sensors, temperature sensors, etc., and those could behave unexpectedly, although even then there might be redundancy or ways to compensate for small errors.

I don't see they "passed it off" as a microcode issue, just said that a microcode patch could fix it. As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue". Most things can be fixed with firmware/microcode patches, by design. And many things are. For example if some voltage sensor circuit on the chip behaved a bit differently than expected in the design but they could correct it by adding some offsets to a table, then the "issue" is that silicon deviates from the model / design and that can not be changed, but firmware update would be a perfectly good fix, to the point they might never bother to redo the sensor even if they were doing a new spin of the masks.

On the voltage issue, they did not say it was requesting an out of spec voltage, they said it was incorrect. This is not necessarily detectable out of context. Dynamic voltage and frequency scaling and all the analog issues that go with it are fiendishly complicated, voltage requested from a regulator is not what gets seen at any given component of the chip, loads, switching, capacitance, frequency, temperature, etc., can all conspire to change these things. And modern CPUs run as close to absolute minimum voltage/timing guard bands as possible to improve efficiency, and they boost up to as high voltages as they can to increase performance. A small bug or error in some characterization data in this very complicated algorithm of many variables and large multi dimensional tables could easily cause voltage/timing to go out of spec and cause instability. And it does not necessarily leave some nice log you can debug because you can't measure voltage from all billion components in the chip on a continuous basis.

And some bugs just take a while to find and fix. I'm not a tester per se but I found a logic bug in a CPU (not Intel but commercial CPU) that was quickly reproducible and resulted in a very hard lockup of a unit in the core, but it still took weeks to find it. Imagine some ephemeral analog bug lurking in a dusty corner of their operating envelope.

Then you actually have to develop the fix, then you have to run that fix through quite a rigorous testing process and get reasonable confidence that it solves the problem, before you would even make this announcement to say you've solved it. Add N more weeks for that.

So, not to say a dishonest or bad motivation from Intel is out of the question. But it seems impossible to make such speculations from the information we have. This announcement would be quite believable to me.

I agree with most of what you said, so cherry picking one thingy to reply to isn't my intention, but

"And some bugs just take a while to find and fix."

I think it's less that it took awhile to find the bug/etc, more so that they've been pretty much radio silent for six months. AMD had the issue with burning 7 series CPUs, they were quick to at least put out a statement that they'll make customers whole again.

Well as it comes to Intel executive management and PR, I'm entirely unqualified to make any educated comment or speculation about it. I can't say I'm aware of Intel ever having great renown for its handling of product defects though.
Oh, I'm certainly the same, just some rando enjoying my popcorn.
> As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue

They claimed:

> a microcode algorithm resulting in incorrect voltage requests to the processor.

I was responding in context of OP's theory that their statement may not be entirely truthful.
The thing is, "incorrect" implies the existence of a static "correct". Which I interpret as a static spec which a microcode bug violated and could be fixed back to that static spec with a simple microcode update.

I do find your suggested scenario to be very plausible. That Intel have discovered their original voltage algorithm was flawed, leading to instability. And it is very feasible that simply updating the microcode is the correct fix for such an issue.

If Intel had explicitly stated that the original voltage algorithm spec was wrong, and the new one fixes the issue, I'd be pretty willing to believe them, and probably wouldn't have written that comment.

I'm not saying your integration of "incorrect voltage" as meaning "voltage that we now know causes instability" is wrong. It's an ambiguous statement and either interpretation is valid. But I have experience working with PR people, they know how to avoid ambiguous statements.

PR people are also experts at using ambiguous statements to their advantage. Crafting statements where not only are there multiple possible interoperation, but statements where the average reader will tend to interpret in the best possible way. I have experience in helping PR people to craft such statements. There are a few other examples of "ambitious statements" in that statement, which leads me to question the honesty of the whole thing.

I believe that the waters may be muddied enough that they wont have to do a full recall and only if you 'provide evidence' the system is still crashing.
> I find it hard to believe that it actually is a microcode issue.

They learned a lot from the Pentium disaster, even if it's a hardware issue, they can address it with microcode at least, which is just as good.

Except normally the result of a microcode workaround is that the chip no longer performs at its claimed/previously-measured level. Not "as good" by any standard.

For example, Intel CPU + Spectre mitigation is not "as good" as a CPU that didn't have the vulnerability in the first place.

Microcode changes don't have to affect performance negatively. Do you have any evidence this one will? If it's a voltage algorithm failure, then I would expect that they could run it as advertised with corrected microcode. Unstable power is a massive issue for electronics like this and I have no problem believing their explanation. Bad power causes all sorts of weird issues.
If it was a microcode bug to begin with, fixing the bug wouldn't need to degrade performance. If it was e.g. a bad sensor, that you can "correct" well enough by postprocessing, it doesn't need to degrade performance. But if it's essentially incorrect binning -- the hardware can't function as they thought it would, use microcode to limit e.g. voltage to the range where it works right -- then that will degrade performance.
> If it was a microcode bug to begin with, fixing the bug wouldn't need to degrade performance.

This is both a completely untrue statement, and a judgement on a fix that hasn't been released yet.

At least with spectre applying the mitigation was a choice. You could turn it off and game at full speed, while turning it on for servers and web browsing for safety.

This is busted or working.