Hacker News new | ask | show | jobs
by KuiN 1383 days ago
Yeah there are no major microarchitectural changes that we're aware of for these chips. Bigger L2$, support for AVX-512 (kinda ... double pumping 256bit units), possibly wider front-end but not a huge amount more; it's primarily the new process making the difference.
2 comments

The support for AVX-512 is not "kinda". AVX-512 at the Ice Lake level is supported. The most widespread Intel CPU with AVX-512 support is Tiger Lake and its support is no better than that of Zen 4, it also provides one 512-bit pipeline for FMA or multiplication and two 512-bit pipelines for simple operations, the same as Zen 4.

What happens is that Zen 4 has the same execution units as Zen 3, so any program which can keep all the execution units busy is accelerated on Zen 4 only by the greater clock frequency.

However Zen 4 has a new frontend for instruction fetching and decoding and for branch prediction. Many programs will be executed more efficiently than on Zen 3, with a better utilization of the execution units, leading to the claimed IPC improvement of 13% on average.

Additionally, rewriting a program to use AVX-512 can also improve the utilization of the execution units, leading to a speed-up greater than the clock frequency ratio.

Support for a certain ISA does not imply anything about the speed of the CPU, even if sometimes the CPU vendors change in the same generation both the ISA and the microarchitecture, resulting in greater throughput.

In this case AMD has postponed the improvement of the execution units for Zen 5. Even if the support for AVX-512 does not improve the maximum possible throughput, it improves the average throughput over many programs. The same is true for most of the Intel CPUs that support AVX-512, except for the top models of server or workstation CPUs, because they have one of the 512-bit FMA units disabled, which results in the same maximum throughput as on Zen 4 or on the older CPUs, since Haswell.

You are correct in saying that ISA doesn't matter. The only difference is that it is easier to do 8 instruction parallel pre-fetcher on ARM or another fixed length instruction architecture than on x86_64, and decoding more instructions can be better for re-ordering and register renaming.
Didn't AVX-512 turned out to be a flop? e.g. https://en.wikipedia.org/wiki/AVX-512#Performance -

"On some processors AVX-512 instructions cause a frequency throttling even greater than its predecessors, causing a penalty for mixed workloads. The additional downclocking is triggered by the 512-bit width of vectors and depend on the nature of instructions being executed, and using the 128 or 256-bit part of AVX-512 (AVX-512VL) does not trigger it. As a result, gcc and clang default to prefer using the 256-bit vectors. ()"

() - https://stackoverflow.com/questions/56852812/simd-instructio...

The AVX-512 instruction set has never been a flop. It is much a much better instruction set than AVX.

Most AVX-512 instructions have 3 variants, with 512-bit registers, with 256-bit registers or with 128-bit registers.

When using the 256-bit or the 128-bit AVX-512 instructions, there has never been any disadvantage versus using AVX.

The only problems have been when using the 512-bit AVX-512 instructions, especially on the CPUs derived from Skylake Server, due to the way how Intel has implemented the clock frequency control.

Using the 512-bit AVX-512 instructions requires more power than when using the 256-bit AVX-512 instructions, the same as when using e.g. 4 cores instead of 2 cores. In both cases, when doubling the operation width or when doubling the number of active cores, the clock frequency is reduced.

When a program has a large proportion of 512-bit instructions, then the throughput is higher despite the lower clock frequency.

On the other hand, when a program has only a few 512-bit instructions, the execution will be slowed down for almost a second after 512-bit instructions are no longer used, until the CPU decides to power down the upper half of the 512-bit units.

All this problem is caused because the Intel CPU tries to be too smart and decides automatically when to power down the unused units.

In the similar case when using more cores, there is no problem because when the core is no longer used, the program has a halt or a MWAIT instruction which powers down immediately the core, restoring the higher clock frequency.

If Intel had provided an instruction like "end of 512-bit instructions" to power down the upper halves of the execution units immediately, there would have been no problems with the slow down caused by sporadically using a few 512-bit instructions, exactly like there is no problem when launching some extra execution threads, because the clock frequency is restored when the extra threads finish or are suspended.

Because Zen 4 has the same execution units as Zen 3, using AVX-512 on Zen 4 will not cause any kind of slow down that would not have also happened when using AVX on Zen 3.

Thank you Adrian (B) for explaining this thoroughly! I'll use your comment as future reference for me!
AVX-512 was a bit of a flop initially, because of how Intel implemented it. AMD's solution doesn't provide quite as much peak throughput for highly-optimized code, but is a better way of providing the flexibility benefits of AVX-512 to the masses without the severe downclocking. There may still be plenty of situations where it would make sense to use 256-bit vectors with AVX-512 instructions, but on Zen 4 there won't be a strong reason to avoid 512-bit vectors where they are useful.
It was a flop because it was intended for a process node that Intel was delayed on for years. It had massive problems to the point of not really making sense when backported to older nodes.
I don't think it's accurate to say AVX-512 was backported. The original Skylake consumer CPUs released as the second generation of products on Intel's 14nm already had space reserved in the CPU core floorplan for the AVX-512 register file. That space didn't get used until the Skylake server CPUs shipped, still on 14nm several years later. AVX-512 support didn't arrive in the consumer desktop product line until Rocket Lake, which was backported to 14nm but was not remotely the beginning of the AVX-512 story.