Hacker News new | ask | show | jobs
by ajross 847 days ago
Ferrari for sure warrants their cars as sold. But if you take it to a mod shop and put in an aftermarket turbo that damages your valves, you don't go whining to HN with an article with "Ferrari Engine Instability" in the title, do you?

I don't know what you want Intel to do here. They tell you upfront what the power and clock limits are on the parts. But the market has a three decade history of people pushing the chips a little past their limit for fun and profit, so they "allow" it even if they know it won't work for everything.

2 comments

(Oodle maintainer here.) This issue only occurs on some small fraction of machines, but on those that we've had access to, it reproduces with BIOS defaults and no user-specified overclocking. It turns out several of these mainboards will overclock and set other values out of spec even at BIOS defaults.

I don't have a problem with end users experiencing instability once they manually overclock (that's how it goes), but CPU + mainboard combinations experiencing typical OC symptoms with out-of-the-box settings is just not OK.

This appears to be an arms race between mainboard vendors all going further and further past spec by default because it gives better benchmark and review scores and their competition does it. Intel for their part are themselves also dialing in their parts more aggressively (and, presumably although I don't know for sure, with smaller margins) over time, and they are for sure aware that this is happening, because a) even had they not known already (which they did) they would have learned about this months ago when we first contacted them about this issue, b) technically out of spec or not, as long as it seems to work fine for users and makes their parts look better in reviews, they're not going to complain.

However, it turns out, it does not work fine for at least some small fraction of machines. I have no idea what that percentage is, but it's high enough that googling for say "Intel 13900K crash" yields plenty of relevant results. Some of this will be actual intentional overclockers but, given how boards default to some extend of out-of-spec overclocking enabled, it's unlikely to be all of them.

Meanwhile we (and other SW vendors) are getting a noticeable uptick in crash reports on, specifically, recent K-series Intel CPUs, and it's not something we can sanely work around because the issue manifests as code randomly misbehaving and it's not even when doing anything fancy. The Oodle issue in particular is during LZ77-family decompression, which is to say, all integer arithmetic (not even multiplies, just adds, shifts and logic ops), loads/stores and branches. This is the bare essentials. If it was an issue with say AVX2, we could avoid AVX2 code paths on that family of machines (and preferably figure out what exactly is going wrong so we can come up with a more targeted workaround). But there is no sane plan B for "integer ALU ops, load/stores and branches don't work reliably under load". If we can't rely on that working, there is not enough left for us to work around bugs with!

I realize this all looks like finger-pointing, but this is truly beyond our capacity to work around in a sane way in SW, with what we know so far anyway. Maybe there is a much more specific trigger involved that we could avoid, but if so, we haven't found it yet.

Either way, when it's easy to find end user machines that are crashing at stock settings, things have gone too far and Intel needs to sit down with their HW partners and get everyone (themselves included) to de-escalate.

As a data point, I just built a new dev machine with a 14900K on a new ASUS board.

Out of the box with default settings, it was pushing 320W through the CPU in stress tests.

I use my machine for FPGA compiles so I need reliability. I learned that ASUS Multicore Enhancement is not the only thing that must be disabled, you must manually enter the power limits.

Now my compiles take exactly the same length of time but use at least 100W less power.

I am glad to know that with your field data, I've inadvertently sidestepped a potentially catastrophic bug. I don't want to release an FPGA bitstream to users with flipped bits. And the FPGA tools already crash on their own enough.

Yeah, pushing current CPUs (Intel and AMD both) as far as they'll go is well into the diminishing returns. For AMD HW I'd likewise recommend using one of the "Eco" modes. The single-digit percentage points you get out of those last few hundred MHz really don't move the needle on productivity workloads, and the power draw reduction is substantial. It also makes the machines much quieter under load.
> I realize this all looks like finger-pointing

You think that might have something to do with you having put "Intel Processor Instability" in the title of a whitepaper on an issue that you already root caused to motherboard settings? I mean, did you want to troll a big flame war? Because this is how you troll a big flame war.

It's an issue that, while the mainboard is involved, happens to occur on (at least) the 3 best-selling mainboard vendors compatible with that family of CPUs, at stock settings, so that you can take an affected CPU, swap it through a selection of the most popular mainboards compatible with said CPU and see the same kind of instability problems.

I don't think it's unreasonable to call that Intel's problem, maybe not in terms of culpability (but truly, nobody cares) but definitely in the sense this is doing damage to their brand. If the mainboards are all out of spec then they need to talk about this publicly, rein them in, start a certification program, whatever. Being publicly completely fine with this as long as it results in good review scores but then starting to go "well actually..." when there's stability issues on a small fraction of sold units is not a good look.

> I don't think it's unreasonable to call that Intel's problem

You didn't call it Intel's problem. You said Intel CPUs were "unstable", which simply isn't true. If your title was "Intel doesn't police default BIOS clocking", we wouldn't be this far down in the senseless thread about semantics. (Though to be fair, you wouldn't have been on the front page as long either, so maybe that's as intended.)

These Motherboards are Intel certified. If I get a mod shop to install a ferrari certified part, I expect the part to work.
Ferrari does not allow modifications of their cars. If you take it to a mod shop, they will void the warranty and you will be banned from purchasing a new Ferrari.
> Ferrari does not allow modifications of their cars

That's a blanket statement, and wrong. Ferrari doesn't allow unlicensed modifications of their cars.

You are able to customize many vehicles to your liking. And just like you can choose the options before sale, you're free to replace one official part with another official part after sale as well.

From what I can tell, this is limited to rims, tires, brakes, seats, passenger display, and other similar configuration options, though.

That's not what anyone means by modification. Of course you can choose a custom factory setup (which is purchasing a new car without modifying it) and swap an official part for another official part (usually called maintenance). This has nothing to do with what people usually mean by car modification, where you are absolutely free to change whatever you want, and do crazy stuff the manufacturer never imagined. They don't allow true modification by the normal meaning of the word for anyone who is interested in cars. They only allow you to choose from a limited selection of factory specs. And as you mentioned the options are minor, not the drivetrain itself. No real car enthusiast calls that modification.

Search for "3000hp lambo" on youtube and you'll see what modification actually means.

You're right in every way, it just doesn't matter in this context.

We're talking about using one intel-certified part with another intel-certified part using intel-certified default settings.

Meh. So you're in the "Intel should take affirmative action to prevent overclocking" camp. And as mentioned the response to that is that they've tried that (on multiple occasions, using multiple techniques) and people freaked out about that too. They can't win, I guess.
Absolutely not. There are two opinions that I hold that are relevant here:

- If I buy parts that are certified to work together, and I use them according to their respective manuals, they should work as specified.

- If I desire to manually change or customise something, I should be able to modify whatever I'd like.

- As soon as my changes go outside the certified range, I'm liable myself. But as long as I'm within of the certified range, warranty should still apply and the product should continue working as specified.