| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tarlinian 2132 days ago

This isn't the real problem though...the problem with the existing AVX-512 implementations is the "relaxation time" causes subsequent scalar code to be slow.

From later in the same post:

> Here, we have the worst case scenario of transitions packed as closely as possible, but we lose only ~20 μs (for 2 transitions) out of 760 μs, less than a 3% impact. The impact of running at the lower frequency is much higher: 2.8 vs 3.2 GHz: a 12.5% impact in the case that the lowered frequency was not useful (i.e., because the wide SIMD payload represents a vanishingly small part of the total work).

Interestingly enough, this is another feature that is supposed to have been improved on server Icelake. The frequency transition halt time is now pretty much negligible. The "core frequency transition block time" goes from ~12 us on CLX (similar to the number quoted above) to ~0 us on ICX.

(Slide with frequency transition info: https://images.anandtech.com/doci/15984/202008171754441.jpg)

1 comments

Dylan16807 2132 days ago

Either way, it's a big problem that certain single instructions can cause this transition. When the transition is based on a usage threshold of heavy instructions, it's not so bad. And with this revision, more of the transitions are based on threshold. But there are still some instructions that cause an immediate frequency change, if I'm reading the articles right.

link

tarlinian 2131 days ago

No...the whole point is that the single instruction induced halt for downclocking isn't a real issue. Even in the pathological case where you insert a single instruction spaced 760 us apart in order to induce the maximum number of clock shifts, the total performance degradation due to the clock halts is only 3% (the frequency drop that is induced by the use of these instructions has a much larger impact.) Furthermore, on Icelake-SP, the halted time due to frequency transitions is supposed to go to 0, which makes this aspect of the problem entirely irrelevant.

Yes, if you insert a single 512-bit FMA that runs every so often in your code you will get a 15% performance hit from the lower frequency, but that's much less likely than the old case where people who were trying to use AVX-512 for memcpy and the like would slow down scalar code.

link

Dylan16807 2131 days ago

But they fixed the old case, by having a minimum number of heavy instructions before changing clocks. If you have some instructions there just for the occasional memcpy, it will be a little slow during the memcpy but it won't downclock and the overall impact will be very small.

Now that the older and bigger case is fixed, this case remains the last sticking point. Because you still can't trust the CPU to do the right thing when there are a small number of heavy instructions. Even if they cut the halting time to 0, it's still bad for a single instruction to cause a prolonged downclock.

link