Hacker News new | ask | show | jobs
by Ochi 848 days ago
So ideally, we should disable hyper threading to mitigate security issues and now also disable turbo mode to mitigate memory corruption issues. Maybe we should also disable C states to avoid side-channel attacks and disable efficiency cores to avoid scheduler issues... and at some point we are back to a feature set from 20+ years ago. :P
7 comments

It seems like problems occur from different firmware from the various motherboard manufacturers. I have a motherboard with a Ryzen 7950x and it would randomly not boot. I'd have to remove the battery from the system, let it fully reset, and then it would work again. Finally an update to the firmware fixed that bug.
Or just disable overclocking.
Intel is already running their CPUs at the red line. We're seeing the margin breaking down as Intel tries to remain competitive. The latest 14900KS can even pull > 400W. It's utter insanity.
I wish I kept up with them better, I swear every 3 months I see a headline that is "Intel says N nodes in {N-1 duration - 3 months}. I think I just saw 5 nodes in 4 years? And we've had 2 in the last 4? Sigh.
At least the built in “multicore enhancement” type overclocks that are popular nowadays with motherboard manufacturers.

I wonder if the old style “bump it up and memtest” type overclocking would catch this. Actually, what is the good testing tool nowadays? Does memtest check AVX frequencies?

Intel Performance Maximizer is from Intel so I'd hope it has good tests.
But isn't overclocking the entire point of buying the K version of these chips?
Definitely not. These are supposed to be higher-quality bins that also ship with higher stock clock rates (both base and boost) and are rated for them.

I don't know how common this is across the whole population of PC buyers, but personally, I have for sure bought K-series parts then not clocked them past their stock settings, trusting that they are rated for it and deeply uninterested in any OCing past that. (I prefer my machines stable, thank you very much.)

Decades ago I visited a fellow Amiga user's house. He had an overclocked 68060 Apollo board.

He was so happy with the speed. Would not stop telling everyone, and talking about it. Yet as I watched him demo it, it rebooted every minute or so. Most unstable thing ever.

Sure it booted in 2 seconds, and he just went about his merry way, but.. what?! Guy could have still overclocked a little less and had stability, but nope.

Some overclockers are weird.

Yes it is but there's more to overclocking than just the CPU. You also need adequate cooling and fine-tuning of parameters I'll never truly understand. There are so many moving parts that you're not guaranteed anything. It seems like the CPUs were actually running at their overclocked speeds, but the rest of the system couldn't keep up.
Also might need to raise voltage etc
Why is this downvoted? That's exactly what's happening here. The affected devices are being overclocked, and the instructions at the end of the linked support document detail how to find the correct limits for your CPU and set them in your BIOS.
I think it is because overclocking has become so normal that it is an expected feature on most chips. Being told to disable it is like being told to disable the supercharger on your new Ferrari: you are no longer getting what you thought you had paid for.
If you put your foot to the floor in a supercharged car, you're going to eventually have to let it up lest you melt things or you burn all your oil because your rings aren't making contact with the cylinder walls any more. It's an apt metaphor since the same is true of CPUs. You can't run a CPU at 400 watts continuously for more than a handful of seconds at a time.

The problem is that Intel has normalized it so much that all their high end CPUs do this, and apparently do it often. It's not unexpected that they might be too close to the point where things are melting, so to speak.

I'd rather slower and more stable any day - I chose a Ryzen 7900 over a 7900X intentionally - but that isn't what all the marketing out there is trying to sell. The fancy motherboards, the water coolers, the highly clocked memory all account for lots of markup, so that's what's marketed. I'm not a fan.

It is worth noting a distinction between the terms "overclocking" and "turbo clocking". "Overclocking" has traditionally meant running the clock "over" the rating. "Turbo clocking" is now built in to almost every CPU out there. One technically can void your warranty, whereas the other doesn't.

Since we're mostly technical people here, we should use the appropriate term where the context makes that choice more accurate. It's like virus and Trojan - we SHOULD be technically correct, but that doesn't mean highly technical people aren't still calling Trojans viruses now and then.

Is that true?

I thought the entire premise of overclocking was that it's not officially supported and it may break things.

The whole point is that you're not paying for it and it's entirely at-risk.

Because if you do want a higher level of guaranteed performance, you do need to pay for a faster chip (if it exists).

CPU manufacturers certainly hold the line you stated but motherboard venders have jumped over the line and now sell motherboards that overlock for the end user entirely transparently.

It’s fair for the end user who bought a motherboard that promises a higher clock speed to expect that clock speed.

Do these motherboards explicitly provide a warranty that covers not just damage from overclocking but also CPU errors?

If you can provide links, I'd be curious to see what guarantees they make. "What's fair" depends very specifically on what language they use.

Nah, chips are manufactured in best version and then damaged/fused/lock to downgrade them to create cheaper version (in the same series)

Its cheaper to have a single production line and then lock off features.

As crazy as i sounds it actually cost a little bit more to produce inferior version sold at cheaper prices.

The overclocking was a 'premium' feature due to possibility of melting the chip. But nowadays the temp sensors cut power to prevent catastrophic failure.

Also worth mention the downside to upclocking voltage is increased physical degradation of cores, ie lower lifespawn of cpu.

Well, Ferrari also tells people not to break speed limits. But if their cars started breaking apart at 85mph they would still be blamed. This might not be warranty repair, intel is probably not liable legally, but this should have impact on their reputation: intel put out a chip that does not handle overclocking very well. Ok. I'll remember that when I am shopping for my next chip.
> The whole point is that you're not paying for it

Tell that to anyone who paid extra for a K-series Intel chip.

I don’t think there’s a great car analogy because the ecosystems and stakes are different.

These chips require motherboards to function, and these unlocked chips get their configuration from the motherboard. There’s no analogous entity to Ferrari the company here, it is like you bought an engine from one company, a gearbox from another, and the gearbox had a “responsiveness enhancement” setting that always redlined your RPMs or something (I don’t know cars).

Ferrari for sure warrants their cars as sold. But if you take it to a mod shop and put in an aftermarket turbo that damages your valves, you don't go whining to HN with an article with "Ferrari Engine Instability" in the title, do you?

I don't know what you want Intel to do here. They tell you upfront what the power and clock limits are on the parts. But the market has a three decade history of people pushing the chips a little past their limit for fun and profit, so they "allow" it even if they know it won't work for everything.

(Oodle maintainer here.) This issue only occurs on some small fraction of machines, but on those that we've had access to, it reproduces with BIOS defaults and no user-specified overclocking. It turns out several of these mainboards will overclock and set other values out of spec even at BIOS defaults.

I don't have a problem with end users experiencing instability once they manually overclock (that's how it goes), but CPU + mainboard combinations experiencing typical OC symptoms with out-of-the-box settings is just not OK.

This appears to be an arms race between mainboard vendors all going further and further past spec by default because it gives better benchmark and review scores and their competition does it. Intel for their part are themselves also dialing in their parts more aggressively (and, presumably although I don't know for sure, with smaller margins) over time, and they are for sure aware that this is happening, because a) even had they not known already (which they did) they would have learned about this months ago when we first contacted them about this issue, b) technically out of spec or not, as long as it seems to work fine for users and makes their parts look better in reviews, they're not going to complain.

However, it turns out, it does not work fine for at least some small fraction of machines. I have no idea what that percentage is, but it's high enough that googling for say "Intel 13900K crash" yields plenty of relevant results. Some of this will be actual intentional overclockers but, given how boards default to some extend of out-of-spec overclocking enabled, it's unlikely to be all of them.

Meanwhile we (and other SW vendors) are getting a noticeable uptick in crash reports on, specifically, recent K-series Intel CPUs, and it's not something we can sanely work around because the issue manifests as code randomly misbehaving and it's not even when doing anything fancy. The Oodle issue in particular is during LZ77-family decompression, which is to say, all integer arithmetic (not even multiplies, just adds, shifts and logic ops), loads/stores and branches. This is the bare essentials. If it was an issue with say AVX2, we could avoid AVX2 code paths on that family of machines (and preferably figure out what exactly is going wrong so we can come up with a more targeted workaround). But there is no sane plan B for "integer ALU ops, load/stores and branches don't work reliably under load". If we can't rely on that working, there is not enough left for us to work around bugs with!

I realize this all looks like finger-pointing, but this is truly beyond our capacity to work around in a sane way in SW, with what we know so far anyway. Maybe there is a much more specific trigger involved that we could avoid, but if so, we haven't found it yet.

Either way, when it's easy to find end user machines that are crashing at stock settings, things have gone too far and Intel needs to sit down with their HW partners and get everyone (themselves included) to de-escalate.

As a data point, I just built a new dev machine with a 14900K on a new ASUS board.

Out of the box with default settings, it was pushing 320W through the CPU in stress tests.

I use my machine for FPGA compiles so I need reliability. I learned that ASUS Multicore Enhancement is not the only thing that must be disabled, you must manually enter the power limits.

Now my compiles take exactly the same length of time but use at least 100W less power.

I am glad to know that with your field data, I've inadvertently sidestepped a potentially catastrophic bug. I don't want to release an FPGA bitstream to users with flipped bits. And the FPGA tools already crash on their own enough.

> I realize this all looks like finger-pointing

You think that might have something to do with you having put "Intel Processor Instability" in the title of a whitepaper on an issue that you already root caused to motherboard settings? I mean, did you want to troll a big flame war? Because this is how you troll a big flame war.

These Motherboards are Intel certified. If I get a mod shop to install a ferrari certified part, I expect the part to work.
Ferrari does not allow modifications of their cars. If you take it to a mod shop, they will void the warranty and you will be banned from purchasing a new Ferrari.
Meh. So you're in the "Intel should take affirmative action to prevent overclocking" camp. And as mentioned the response to that is that they've tried that (on multiple occasions, using multiple techniques) and people freaked out about that too. They can't win, I guess.
I think "overclock" implies that the end-user is doing something that's out-of-spec for the thing they're operating.

This "I can run a core at a faster speed" is a documented feature so not really overclocking.

That's literally overclocking. You're clocking it at a rate over the nameplate value. Just because the BIOS is factory-unlocked doesn't really change anything.
If intel sells a "3.2ghz cpu" and also advertises that it can run, thermals allowing, a core or two at 4.2ghz, I don't consider that 4.2ghz core "overclocked" as much as "this chip is engineered to have a variety of clocks as advertised." The chip is made from the factory to operate in a couple different ways, just like my car may have a transmission that allows the engine to spin at a couple different speeds, as duty cycle demand.

If I run the chip in a way not documented by the manufacturer, or modify the ECU to allow the turbo to generate more boost, those are both unsupported modifications, and I'd consider either of those "overclocking"

Yeah. Intel advertises the ability to overclock, but that doesn't mean overclocking is in spec. It just means Intel allows you to run it out of spec if you so choose. The spec says you can set the clock multiplier, it doesn't say anything above the stock range will actually be stable.
Plus almost always people are tweaking voltages and such also.
Right, when they still knew how to make reliable hardware instead of cramming in features that aren't fully thought out and come with questionable tradeoffs to hit the bleeding edge.
TBH if we'd stopped at coppermine or tualatin and focused entirely on making the software better, it probably would be a better world.
Good, fast, cheap. Choose two.
haha, knew it wouldn't take long for the AMD fanboys to get winding up on how awful this is gonna be.

https://news.ycombinator.com/item?id=39479081

Somehow people think that it's a strawman, but people like parent comment actually think and post like this lol

IMO it is worth noting that the “turbo mode,” as you call it, seems to be an overlock that some motherboards do by default. Not the stock boost frequencies.

The hyperthread and c-state stuff, eh, if you want to run code that might be a virus you will have to limit your system. I dunno. It would be a shame if we lost the ability to ignore that advice. Most desktops are single-user after all.

Remember that you run a lot of untrusted code on your single-user desktop through Javascript on websites. Javascript can do all those side channel attacks like Spectre and Meltdown.
Maybe you do, but some of us use NoScript[0] and whitelist sites we trust.

I'm not affiliated with NoScript. I just think it's insane that we run oodles of code to display web pages.

[0] https://noscript.net/

Using no-script made me realize how unchained the Internet has become. Sites with upwards of 15 different domains all running whatever JS they want on your machine. Totally insane.
There are almost certainly unmitigated Spectre-style bugs hiding in modern hardware. People who don’t block JavaScript by default are impossible to protect anyway.
turbo boost is an advertised feature of the chip

these chips that have been specially binned because they are supposedly stable at those frequencies (within an envelope set by intel)

if intel can't get it to work they shouldn't be selling these chips at all

Unless I misread the blog post, there doesn’t seem to be any issue with the stock turbo behavior.
Provided enough cooling, a chip that can boost to its turbo frequency for a few seconds should also run stably at that frequency indefinitely. Nowadays these boost clocks are so high that there is often not much gained by pushing any further.
> The hyperthread and c-state stuff, eh, if you want to run code that might be a virus you will have to limit your system.

So, you are trusting all web pages you view? Because these are unknown code running on your box which probably has some beefy private data.

I know some people browse the web while gaming, but I don't. For the gaming use case, I legit want a toggle that says "yes, all the code I'm running is trusted, now please prioritize maximum performance at all costs." For all I care this mode can cut the network connection since I don't do multiplayer.

I imagine people doing e.g. heavy number crunching might want something similar.

I run noscript and try to be selective about which pages I enable.
Intel should police their own ecosystem.
Why does this reminds me in this big, extremely profitable company that made something every American needs in a while, which seems to have abandoned all sanity in their processes? Looks like Intel and Boeing are on a similar path....
They have, in the past. People (including posters here) absolutely freaked out about clock-locked processors and screamed about the needless product differentiation of selling "K" CPUs at a premium.

People want to overclock. Gamers want to see big numbers. If gamers don't do it their motherboard vendors will. It's not a market over which Intel is going to have much control, really.

Note that you don't, in general, see this kind of silly edgelord clocking in the laptop segments.

Overclocking is ok.

Out of the box default overclocking is not, this aspect should be policed.

FWIW, there's no evidence that this is an "out of the box default" configuration on any of this hardware. Almost certainly these are users who clicked on the "Mega Super Optimizzzz!!!" button in their BIOS settings. And again, overclocking support on gaming motherboards is a feature that consumers want, and will pay for. So of course the vendors are going to provide it.
Oodle maintainer here, we had two people that hit the issue offer to run some experiments for us. Neither were doing any overclocking before and both tried numerous things including resetting to BIOS defaults and also updating their BIOS (there was a known [to Intel] issue affecting some ASUS boards that had been fixed in a BIOS update in spring of 2023, and we were asked to rule it out.)

This issue doesn't affect every such machine, but both people affected by the issue that consented to run tests for us still had the issue reproduce after flashing BIOS to current and with BIOS default settings for absolutely everything.

Among the settings enabled by default on some boards: current limit set to 511 amps (...wat), long duration power limit set to 350W (Intel spec: 125W), short duration power limit also set to 350W (Intel spec: 253W), "MultiCore Enhancement" which is extra clock boosting past what the CPUs do themselves set to "Auto" not "Off", and some others.