A problem is slowing down the CPU frequency significantly when AVX-512 is involved, e.g. https://en.wikichip.org/wiki/intel/xeon_gold/6262v this, which usually cancels out the benefit in the Real World (tm).
This was massively exaggerated by journalists when AVX-512 was first announced.
It is true that randomly applied AVX-512 instructions can cause a slight clock speed reduction, the proper way to use libraries like this would be within specific hot code loops where the mild clock speed reduction is more than offset by the huge parallelism increase.
This doesn’t make sense if you’re a consumer doing something multitasking and a background process is invoking the AVX-512 penalty in the background, but it usually would make sense in a server scenario.
the thing I never understood about this is why Intel didn't just add latency to the avx512 instructions instead? that seems much easier than downclocking the whole cpu
I believe they do actually do something like this - until power and voltage delivery change, wide instructions are throttled independently of frequency changes (which on SKX involved a short halt).
Intel has been trying to reduce the penalty for AVX-512, and barring that, advertise that there is no penalty. Most things on Ice Lake run fine with 256 bit vectors, but Skylake and earlier really needed 128 bit or narrower if you weren't doing serious vector math.
Those are client CPUs, which have very different behavior around power management than server parts. However, AVX downclocking has mostly gone away with ice lake and hopefully sapphire rapids does away with it permanently (except on 512 bit vectors).
Unless someone has data for the latest Intel chips (i.e. sapphire rapids) showing the opposite I'm inclined to think this is a meme from 2016/7 that needs to go the way of the dodo.
It was largely wrong then, too. Cloudflare, who really kicked off a large amount of the fuss, had "Bronze" class Xeon chips, that weren't designed or marketed for what they were attempting to use them for. They were only ever intended for small business stuff. Not large scale high performance operations. Their performance downclock for AVX-512 is way, way higher on Bronze.
Actually, I stand corrected, after double checking, Cloudflare were using Silver. Entry level data centre chips, instead of small business chips. Still not the kind of chips you'd buy for high performance infrastructure, and not intended to be used for such.
Xeon Silver 4116s hit the market at $1,002.00. The Golds were $1,221.00. The performance differences are quite significant. For something that'll be in service for ~3-5 years, $200 is absolutely trivial by way of a per-chip increase. It's firmly in the "false economy" territory to be skimping on your chip costs. It's a bit more understandable in smaller businesses, but you just don't do it when you're operating at scale.
Also remember: at the scales that Cloudflare are purchasing at, they won't be paying RRP. They'll be getting tidy discounts.
I would love to see an example of reasonable code not seeing any benefit. On first generation SKX, we saw 1.5x speedups vs AVX2, and that was IIRC even without taking much advantage of AVX3-only instructions.
Please stop spreading this fallacy, while downclocking can happen, usually the benefit is still strong and superior to avx256. Even 256 can induce downclocking. AVX 512 when properly utilized simply demolish non AVX 512 cpus.
On that one task. The challenge is if the avx512 pieces aren’t a bottleneck in every single concurrent workload you run. It’s fine if the most important thing your running on them is code optimized for AVX512. Realistically though, is that the case for the target market of CPUs capable of AVX512, since consumer use cases aren’t? The predominant workload would be cloud right? Where you’re running heterogeneous workloads right? You’d have to get real smart by coalescing AVX512 and non AVX512 workloads onto separate machines and disabling it on the machines that don’t need it. Very complicated work to do because you’d have to have each workload annotated by hand (memcpy is optimized to use AVX512 when available so the presence of AVX512 in the code is insufficient)
The more generous interpretation is that Intel fixed that issue a while back although the CPUs with that problem are still in rotation and you have to think about that when compiling your code.
It is true that randomly applied AVX-512 instructions can cause a slight clock speed reduction, the proper way to use libraries like this would be within specific hot code loops where the mild clock speed reduction is more than offset by the huge parallelism increase.
This doesn’t make sense if you’re a consumer doing something multitasking and a background process is invoking the AVX-512 penalty in the background, but it usually would make sense in a server scenario.