| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by FaradayRotation 242 days ago

Caveats: My understanding of the Raptor Lake mess is pretty limited, mostly because Intel has been fairly closed lipped on what specific issue caused that. My personal suspicion is that it was a pareto plot's worth of issues. Also, while I do know a few things about this particular topic, I am far from the final authority on it.

My understanding is that point/local resistive heating problems out in the wild tend to drive different failure modes vs the global heating techniques used on the manufacturing line, mostly because the CPU is in active operation, which changes the defect physics. Put another way, likely any particular structure in the CPU would not need to reach 400C to fail - even the small voltages used in these chips coupled with elevated temperature can drive a lot of difficult-to-catch, slow-to-manifest failure modes. Copper metal migration is the classic example of this type of problem, where copper ions slowly migrate under voltage+temperature, causing/propagating voids until finally an open circuit is made. Surprise! your chip no longer works after seeming perfectly fine! Manufacturers try to catch such problems with simulated aging through aggressive temperature and voltage experiments. Intel must have discovered a big gap in their visibility, and then discovered their CPU specs were incompatible with the stated product lifetime without a major re-spec of already sold product. Ouch.

The chip manufacturer also has some capability to make repairs and adjustments ahead of end of line, which should encompass managing some of the issues you refer to. Some big customers might have their own repair capabilities.

Edit: Clarity, trying to better address the question