| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rygorous 842 days ago

(Oodle maintainer here.) This issue only occurs on some small fraction of machines, but on those that we've had access to, it reproduces with BIOS defaults and no user-specified overclocking. It turns out several of these mainboards will overclock and set other values out of spec even at BIOS defaults.

I don't have a problem with end users experiencing instability once they manually overclock (that's how it goes), but CPU + mainboard combinations experiencing typical OC symptoms with out-of-the-box settings is just not OK.

This appears to be an arms race between mainboard vendors all going further and further past spec by default because it gives better benchmark and review scores and their competition does it. Intel for their part are themselves also dialing in their parts more aggressively (and, presumably although I don't know for sure, with smaller margins) over time, and they are for sure aware that this is happening, because a) even had they not known already (which they did) they would have learned about this months ago when we first contacted them about this issue, b) technically out of spec or not, as long as it seems to work fine for users and makes their parts look better in reviews, they're not going to complain.

However, it turns out, it does not work fine for at least some small fraction of machines. I have no idea what that percentage is, but it's high enough that googling for say "Intel 13900K crash" yields plenty of relevant results. Some of this will be actual intentional overclockers but, given how boards default to some extend of out-of-spec overclocking enabled, it's unlikely to be all of them.

Meanwhile we (and other SW vendors) are getting a noticeable uptick in crash reports on, specifically, recent K-series Intel CPUs, and it's not something we can sanely work around because the issue manifests as code randomly misbehaving and it's not even when doing anything fancy. The Oodle issue in particular is during LZ77-family decompression, which is to say, all integer arithmetic (not even multiplies, just adds, shifts and logic ops), loads/stores and branches. This is the bare essentials. If it was an issue with say AVX2, we could avoid AVX2 code paths on that family of machines (and preferably figure out what exactly is going wrong so we can come up with a more targeted workaround). But there is no sane plan B for "integer ALU ops, load/stores and branches don't work reliably under load". If we can't rely on that working, there is not enough left for us to work around bugs with!

I realize this all looks like finger-pointing, but this is truly beyond our capacity to work around in a sane way in SW, with what we know so far anyway. Maybe there is a much more specific trigger involved that we could avoid, but if so, we haven't found it yet.

Either way, when it's easy to find end user machines that are crashing at stock settings, things have gone too far and Intel needs to sit down with their HW partners and get everyone (themselves included) to de-escalate.

2 comments

mips_r4300i 841 days ago

As a data point, I just built a new dev machine with a 14900K on a new ASUS board.

Out of the box with default settings, it was pushing 320W through the CPU in stress tests.

I use my machine for FPGA compiles so I need reliability. I learned that ASUS Multicore Enhancement is not the only thing that must be disabled, you must manually enter the power limits.

Now my compiles take exactly the same length of time but use at least 100W less power.

I am glad to know that with your field data, I've inadvertently sidestepped a potentially catastrophic bug. I don't want to release an FPGA bitstream to users with flipped bits. And the FPGA tools already crash on their own enough.

link

rygorous 841 days ago

Yeah, pushing current CPUs (Intel and AMD both) as far as they'll go is well into the diminishing returns. For AMD HW I'd likewise recommend using one of the "Eco" modes. The single-digit percentage points you get out of those last few hundred MHz really don't move the needle on productivity workloads, and the power draw reduction is substantial. It also makes the machines much quieter under load.

link

ajross 842 days ago

> I realize this all looks like finger-pointing

You think that might have something to do with you having put "Intel Processor Instability" in the title of a whitepaper on an issue that you already root caused to motherboard settings? I mean, did you want to troll a big flame war? Because this is how you troll a big flame war.

link

rygorous 842 days ago

It's an issue that, while the mainboard is involved, happens to occur on (at least) the 3 best-selling mainboard vendors compatible with that family of CPUs, at stock settings, so that you can take an affected CPU, swap it through a selection of the most popular mainboards compatible with said CPU and see the same kind of instability problems.

I don't think it's unreasonable to call that Intel's problem, maybe not in terms of culpability (but truly, nobody cares) but definitely in the sense this is doing damage to their brand. If the mainboards are all out of spec then they need to talk about this publicly, rein them in, start a certification program, whatever. Being publicly completely fine with this as long as it results in good review scores but then starting to go "well actually..." when there's stability issues on a small fraction of sold units is not a good look.

link

ajross 840 days ago

> I don't think it's unreasonable to call that Intel's problem

You didn't call it Intel's problem. You said Intel CPUs were "unstable", which simply isn't true. If your title was "Intel doesn't police default BIOS clocking", we wouldn't be this far down in the senseless thread about semantics. (Though to be fair, you wouldn't have been on the front page as long either, so maybe that's as intended.)

link