Hacker News new | ask | show | jobs
by magila 1813 days ago
A key problem with this proposal is that modern CPUs do not have a single definitive maximum frequency. You have the base frequency which is rarely relevant outside of synthetic workloads then you have a variety of turbo frequencies which interact in complex ways. AMD's latest CPUs don't even have clear upper bounds on their frequency scaling logic. It's a big black box and the results can vary depending on the workload, temperature, silicon quality, phase of moon, etc.
6 comments

CPU usage is an incredibly complex metric that doesn't really explain what is actually going on. I noticed while running a benchmark I can get my CPU to 50c and 100% usage and it stays steady like that. But then I tried prime95 and my cpu very quickly hit 99c also at 100%. Likely the different benchmarks were both running as fast as they can but the prime95 one ran on a faster part of the cpu which could generate more heat and does not get stuck waiting on memory or other slow ops.
CPU usage calculation can be (mainly) broken down into "percentage time spent not idle due to program" vs "percentage of max performance used by program". The former is usually what an end user cares about and measures the amount a program is hogging the system. The latter is usually what a programmer wants to know to see how optimized the program is.

Both are extremely complex with a lot of nuance after that but the "percentage time spent not idle due to program" type this article refers to tends to be simpler than trying to figure out cache/mode switches/instruction level parallelism/instructions per clock per type/and so on on top of everything you need to figure out in the first case anyways.

Sounds like you're actually comparing your CPU in avx mode vs not. This is a common source of this type of behavior, avx is insanely power hungry.
I've found that CPU temperature is an extremely useful way to measure load on the system. All of my computers mine cryptocurrency in the background. I used cgroups to limit their CPU usage, tuning the parameters until the CPU temperature reached acceptable levels. Sometimes I also observe huge spikes in temperature when I do random things. Scrolling certain javascript-heavy pages causes my temperature readings to briefly spike all the way up to 90 °C!
i think you could prove that on linux by getting the number of instructions run. I think perf has a way of getting that info for you. not sure if windows has something simmilar
Hmm, different workloads will have different mixes of instructions, which will take very different amounts of time per instruction (e.g. a chain of uncached memory lookups vs a tight loop of register-register arithmetic). If you had two instances of the same workload, and one was throttled, then yes, you could compare instructions per unit time. But comparing prime95 vs the other benchmark is likely misleading.
Yeah that would be the thing to prove that prime95 is going through more instructions so the cpu runs hotter
I see. But I expect that some instructions cost more (i.e. generate more heat) than others. Someone mentions AVX instructions, for example; presumably 1 million AVX-512 fused-multiply-add instructions cost a lot more than 1 million loads (we can probably arrange for the right proportion of loads to hit a CPU cache vs. go to main memory, so that they end up taking the same time). Even without AVX, I imagine things like 64-bit integer divisions or CRC32 instructions cost more than loads or stores, though I don't know by how much.
oh that makes sense, i guess that would be another aspect to measure
> my cpu very quickly hit 99c also at 100%

That should never happen and it's ridiculous that we just let manufacturers get way with it.

Why? 100c is typically the max nominal safe operating temp for CPUs. It would be a waste of resources to add additional cooling to computers not intended to run these type of workloads.

Prime95 is basically a synthetic workload so it makes little sense to optimize for it.

Yeah, the thing is it doesn't only happen in prime95. Nowadays it's any prolonged use where the CPU is fully used, like video editing or gaming. Give an inch and they'll take a mile, as the saying goes.

Temperature junction throttling is a last resort. No laptop should rely on it in normal operation. Of course, both HP/Dell/Lenovo/etc and Intel benefit from increased sales so they don't care.

Counterpoint - during typical (consumer) usage hardware spends most of the time idle. Hardware capable of sustaining the maximum workload indefinitely is likely to have a lower maximum in practice. Unsustainable bursts are likely to provide higher overall performance for typical workloads, so it makes sense to optimize the hardware design for those.
I guess, but workstation class laptops still overheat, so again, piss poor design.
How did you determine that 100c is safe?
The manufacturer did and put it in the datasheet of the CPU
In extension to what the other commenter said, critical temperature that is listed is usually higher than the maximum operating temperature so it is safe to run at the maximum.
It is safe only because it throttles heavily. Shutdown temperature is just 5 degrees higher btw.

At 100 the processor is at high risk of damage, which is why there's a built-in throttling mechanism.

Just imo, if it was safe it would be running at full speed (or at least max base clock) at that temperature.

As someone else said, they're made for burst operation these days, but again, that does not excuse manufacturers using subpar cooling.

I can see the majority is fine with it, but I'm not. A 15-25% failure rate in 2 years would make any other product a rotten lemon. But somehow it's acceptable for computers. Probably because people replace them every 2 years regardless, which is another insanity on its own.

I was testing overclocks when this happened. This was also a very old intel chip.
By this proposal's logic, it would make total sense for processes on a boosting CPU core to report more than 100% cpu usage.
“Military power “ seems to fit perfectly…

> The etiology of military power is from War Emergency Power (WEP) which in the WWII era was a higher than normal rating power (i.e. >100% rated power) setting on an aircraft engine. Such power settings were approved for short durations (typically 5 minutes or less) such as takeoff and battle maneuvers.

> The term was quickly shortened to military power.

Please let's not add yet more military fetishizing to IT. "Military power" should refer to things like military engines, not totally unrelated electronic performance boost modes.
IT overlap is mostly because computers and the internet are military technology. Mainframes, IC's and the transistor itself were bankrolled by US military money.

The internet and GPS, the core technologies of our time, are both direct descendants of the US military.

You may not like it, but the entire IT and tech field are built on US military tech

Civilian and military engineering history are intertwined. The tech transfer doesn't go one way only.

In ancient times both mercantile and military incentives propelled the design of ships. Is fair to say the whole ship building field built on military tech?

You're just picking a point in time and calling that the beginning; every one of those things is based on previous tech with no military background.

Even if it were true that in the distant past a thing had a military application/funding, there's no reason to use their terms for new things which aren't military in nature. No commercial liner calls it's full speed "military power". It's just cosplay at this point.

If you go far enough back all of our technology arises from fire and the wheel.

There's plenty of room to debate how much of our technology came from the military, but I don't think naming is a big deal.

Take the word "screen" for example. The original meaning was a partition to protect from heat. Many of these were fabric, and led to "magic lantern" shows done with shadows. The word was repurposed again for the projection era. And yet again for tube TV's and beyond.

If you told somebody from 100 years ago to look at the screen they would have no idea what you're talking about because the original meaning is lost.

That's why I don't think it matters whether we use military terms for computer stuff. It doesn't have its original meaning anymore.

"Overdrive"
Turbo. I miss the old turbo button, even when it didn't actually do much of anything.
turbo actually slowed the cpu to 4.77 MHz (the speed of 8086/8088) because early programs's timing logic were based on 4.77 MHz clock frequency
This used to be the case -- XP era had it report as a percentage of the target frequency, so an Athlon XP would usually run at 110%. This is confusing to people who believe 100% is the maximum.
"these go to eleven!"

https://youtu.be/hW008FcKr3Q

Linux control groups and systemd CPU quotas use values larger than 100% for denoting more than one CPU.

https://www.freedesktop.org/software/systemd/man/systemd.res...

> The percentage specifies how much CPU time the unit shall get at maximum, relative to the total CPU time available on one CPU. Use values > 100% for allotting CPU time on more than one CPU.

Absolutely, if this really matters to you I think the only solution is to characterise the CPU over an afternoon or two and build a model based on the data rather than what the Manufacturer told you.
The OS can keep a record over time of the maximum frequency each CPU core has ever hit.

This will take into account machine to machine variance, and even environmental factors effecting maximum speed.

Oh, but it gets more fun. The same operation can take more clocks for various reasons... If I really undervolt my Zen 2 apu, power usage and benchmarks go way down, but clock frequency stays high; the CPU is clock stretching and it gets a lot less work done.

Anyway, current processors run a separate clock per core, and maximum clocks are only available when a small number of cores are active; if all cores are busy, that should really be 100%, even if each core is only doing 80% of max for a single core.

Mostly, I want to see % of time cpu is busy, and separately, stats on how throttled the cpu is, because it's hard to combine both into a coherent number. Maybe also some idea of how much of the core is being exercised, if it can be easily measured... I'd love to know when a program is keeping the cpu busy, but not making good use of it.

That strategy guarantees that a process that runs for multiple minutes while consuming all available CPU cycles will be reported as using 100% CPU at most during the first few seconds, after which it will usually be reported as using somewhere less than 90%, and realistically could be reported as low as 65%. How is this helpful?
Estimating x86 core frequency is a lot trickier than you've implied.
Ideally, in the article’s example, the program generating heat and causing the throttle would have the throttled time counted against it (as a separate metric).

That’s hard to do without hardware support, but I wonder how hard it would be to get a decent metric with currently available performance counters.

not only that, but modern CPUs aren't even deterministic in how they run an instruction. the result of any particular instruction is deterministic, but the location of the silicon and the amount of silicon used to complete the instruction is not deterministic anymore.

CPU Usage measured in percentage doesn't make sense on any CPU with modern performance features like speculative execution and branch prediction.