Hacker News new | ask | show | jobs
by adrian_b 742 days ago
For most instructions, both Intel and AMD CPUs with AVX-512 support are able to do two 512-bit instructions per clock cycle. There is no difference between Intel and AMD Zen 4 for most 512-bit AVX-512 instructions.

I expect that this will remain true for Zen 5 and the next Intel CPUs.

The only important differences in throughput between Intel and AMD were for the 512-bit load and store instructions from the L1 cache and for the 512-bit fused multiply-add instructions, where Intel had double throughput in its more expensive models of server CPUs.

I interpret AMD's announcement that now Zen 5 has a double transfer throughput between the 512-bit registers and the L1 cache and also a double 512-bit FP multiplier, so now it matches the Intel AVX-512 throughput per clock cycle in all important instructions.

4 comments

> There is no difference between Intel and AMD Zen 4 for most 512-bit AVX-512 instructions.

Except for the fact that Intel hasn't had any AVX-512 for years already in consumer CPUs, so there's nothing to compare against really in this target market

The comparison done by the AMD announcement and by everyone else compares the Zen 5 cores, which will be used both in their laptop/desktop products and in their Turin server CPUs, with the Intel Emerald Rapids and the future Granite Rapids server CPUs.

As you say, Intel has abandoned the use of the full AVX-512 instruction set in their laptop/desktop products and in some of their server products.

At the end of 2025, Intel is expected to introduce laptop/desktop CPUs that will implement a 256-bit subset of the AVX-512 instruction set.

While that will bring many advantages of AVX-512 that are not related to register and instruction widths, it will lose the simplification of the high-performance programs that is possible in 512-bit AVX-512 due to the equality between register size and cache line size, so the consumer Intel CPUs will remain a worse target for the implementation of high-performance algorithms.

> of the high-performance programs that is possible in 512-bit AVX-512 due to the equality between register size and cache line size, so the consumer Intel CPUs will remain a worse target for the implementation of high-performance algorithms.

Can you elaborate here? I love full-width AVX-512 as much as the next SIMD nerd, but I rarely considered the alignment of the cache line and vector width one of the particularly useful features. If anything, it was a sign that AVX-512 was probably the end of the road for full-throughput full-width loads and stores at full AVX register width, since double-cache line memory operations are likely to be half-throughput at best and a doubling of the cache line width seems unlikely.

I read that comment as "the wider, the sweeter" (which I agree with), but that we're now (as you say) at the end of the road, and thus the sweetest point.

But an increase in cacheline size would be nice if it can get us larger vectors, or otherwise significantly improve memory bandwidth.

it does seem plausible. Apple has gone to 128 bit cache line.
The difference is intel chips that support AVX-512 run $1,300 - $11,000 with MUCH higher total system costs whereas AMD actually DOES support AVX-512 on all it's chips and you can get AVX-512 for dirt cheap. The whole intel instruction support story feels garbage here. Weren't they the ones to introduce this whole 512 thing in the first place?
My view is that Intel is trying to do market segmentation. Want AVX512? You need to buy the "pro" lineup of CPUs...
That's being very generous to Intel. It's much more believable that the strategy of mixing performance and efficiency cores in mainstream consumer processors starting with Alder Lake was a relatively last-minute decision, too late to re-work the Atom core design to add AVX512, so they sacrificed it from the performance cores instead.

The Intel consumer processors due out later this year are expected to include the first major update to their efficiency cores since they introduced Alder Lake. Only after that ships without AVX512 will it be plausible to attribute product segmentation as a major cause rather than just engineering factors.

(Intel's AVX10 proposal indicates their long-term plan may be to never support 512-bit SIMD on the efficiency cores, but to eventually make it possible for full 512-bit SIMD to make a return to their consumer processors.)

No question - but these kind of games lead to poor adoption.

They made a big todo about how much better than AMD they are with things like AVX512.

Then they play these games for market purposes.

Then they have stupid clock speed pauses, so when you try it your stuff goes slower.

Meanwhile - AMD is putting it on all their chips and it works reasonably there.

So I've just found the whole Intel style here kind of annoying. I really remember them doing bogus comparisons to non AVX-512 AMD parts (and projecting when their chips would be out). Reality is you are writing software that depends on AVX512 - tell clients to buy AMD to run it. Does AVX512 even work on efficiency cores and things like that? It's a mess.

Or Intel's repeated failures to roll out a working <14nm processes forced them to cripple even the P-core AVX-512 implementation (one less ALU port compared to Xeon Gold), rip out whatever compatibility feature they wanted to put in the E-cores and Microsoft refused to implement workarounds in Windows (e.g. make AVX-512 opt-in and restrict threads that enabled it to P-cores, reschedule threads faulting on AVX-512 instructions from E-core to P-cores).
Fair point - that cycle took FOREVER. I remember when their marketing slides starting doing comparisons between their unreleased future products and current (and sometimes about to be replaced) competitor products.
Which doesn't work when your competitor isn't bought in and puts it in every product.
That'd make some sense if AVX-512 was not part of x86-64-v4, but it is.
> The only important differences in throughput between Intel and AMD

Not exactly related, but AMD also has a much better track record when it comes to speculative execution attacks.

I see the discussion of instruction fusion for AVX512 in Intel chips. Can someone explain the clock speed drop?
A clock speed drop must always occur whenever a CPU does more work per clock cycle, e.g. when more CPU cores are active or when wider SIMD instructions are used.

However the early generations of Intel CPUs that have implemented AVX-512 had bad clock management, which was not agile enough to lower quickly the clock frequency, i.e. the power consumption, when the temperature was too high due to higher power consumption, in order to protect the CPU. Because of that and because there are no instructions that the programmers could use to announce their intentions of using intensively wide SIMD instructions in a sequence of code, the Intel CPUs lowered the clock frequency preemptively and a lot, whenever they feared that the future instruction stream might contain 512-bit instructions that could lead to an overtemperature. The clock frequency was restored only after delays not much lower than a second. When AVX-512 instructions were executed sporadically, that could slow down any application very much.

The AMD CPUs and the newer Intel CPUs have better clock management, which reacts more quickly, so the low clock frequency during AVX-512 instruction execution is no longer a problem. A few AVX-512 instructions will not lower measurably the clock frequency, while the low clock frequency when AVX-512 instructions are executed frequently is compensated by the greater work done per clock cycle.

Thank you for the detailed answer :)