AMD Zen has 4 128-bit SIMD units, while Skylake-S (and earlier) Intel chips have 3 256-bit SIMD units, and Skylake-X/SP (hereafter SKX) chips additionally have 1 or 2 512-bit SIMD units (which overlap with the 256-bit units).
Now not all units can run all instructions. E.g., Intel chips run FP instructions on only 2 of those units, but AMD can run FP instructions on all 4, so in that sense they are even.
However, not all FP instructions can run on all units on AMD: FP multiplications run on only 2 units, and FP additions run on two units - so if you are doing all multiplies, both AMD and Intel can do 2 per cycle and since Intel is twice (quadruple on SKX with 2 FMA units) as wide, then Intel is twice as fast. If you do a 1:1 mix of mul and add, however, then AMD and Intel may be tied - but then AMD is further hamstrung by only have 2 128-bit load units, vs Intel's 2 256-bit load units (512-bit on SKX) - so it is entirely possible that many kernels are limited by load/store throughput on AMD.
For things like in-lane shuffles, AMD and Intel (pre-SKX) have the same throughput: AMD has 2 128-bit shuffle units, and Intel 1 256-bit one. For cross-lane shuffles the 128-bit AMD units really struggle since multiple ops are required and Intel wins big.
So AMD's AVX-256 perf being half of Intel's is more or less the worst case, and many cases will see closer performance. If Zen 2 doubles everything to 256-bits everything will change dramatically.
SKX is what Zen competes with, for now. I don't think Zen 2 will arrive for Epyc any time soon, and when it does, maybe Intel will have a new part out. The current timeline for Zen 2 is desktop parts maybe 1Q2019, server parts traditionally lag. We have to look at what's actually in front of us.
No one has announced 256 AVX in Zen 2; while I'm sure it's possible, I'd think AMD would be advertising that at some point if it were happening. On the other hand, they've doubled the chiplet per socket count, which may compensate somewhat for the reduced AVX width per core.
For the record, I'm firmly in AMD's camp here; I appreciate both the underdog aspect and the renewed competition in the x86 space. I own a first-gen threadripper myself. But it's still important to acknowledge where Zen falls short compared to Skylake-X.
I mentioned Zen2 in a single sentence at the end of my reply, in the context of if Zen2 has 256-bit units (I think the CTO has confirmed it will), the comparison wrt to Intel will change. Intel may have another chip out at that point, but it seems very likely that 2x 512-bit FMA units will still be the top end. My comment was also in the context of Zen2 chips in general, not necessarily just EPYC (since at this point we are comparing uarches). It seems relevant enough to mention it since the first of these chips are apparently imminent.
The rest of my comparison was Zen vs SKL and SKX. Zen competes against both of those, SKL in the laptop, desktop and (some) workstation space, and SKX in the server, (some) workstation and HEDT space. As a practical matter, for things like choosing on which hardware to deploy to in the cloud, it still also competes against Broadwell all the way back to Sandy Bridge, since chips of that era still dominate in the data center (and Intel still sells a ton of those chips).
> But it's still important to acknowledge where Zen falls short compared to Skylake-X.
I don't see how you could read my post and come to another conclusion? AVX-256 performance falls somewhere between half (worst case) and approximately equal (best case) to Intel, depending on your load. FMA-heavy, and L1/L2-hit load heavy will be close to worst case, and some integer, suffle/permute heavy or memory bound loads will be closer to best-case.
Yeah, all of this is mostly fair. I dispute the idea that SKL and the laptop/desktop space is significant in a discussion of AVX — I don't think the laptop/desktop space cares much about AVX, and thus SKL (and non-EPYC Zen) is mostly irrelevant (IMO). That isn't a hard fact, though, and your opinion is also reasonable.
I'm not sure it makes too much sense to compare against older Cloud platforms — the same reason older Intel µarchs dominate (deployment of newer hardware takes time and money) also limits the availability of new AMD µarchs. But it is a reasonable point, if the relative prices of the offerings don't reflect the cost of new deployment in the way I imagine they would.
> I don't see how you could read my post and come to another conclusion?
I guess I misread your post! I'm sorry. My initial impression was that it was highly defensive of AMD. But I think I read too much into it. I'm sorry about that.
> I dispute the idea that SKL and the laptop/desktop space is significant in a discussion of AVX
Well I think it is significant. I'd say that overall the laptop/desktop space makes reasonable use of AVX and AVX2, probably more than your average load in the data center.
HPC certainly makes the most use of AVX/AVX2, but on the laptop/desktop you have at least:
- Gaming
- Media encoding and in some cases decoding (this also happens on GPU)
- Rendering and graphics work (this also happens on GPU)
- All sorts of random AVX2 use in compiler generated code and runtime libraries (e.g., AVX2 is used widely in perf sensitive libc routines like memcpy) and even in JIT-generated code
> AVX2 is used widely in perf sensitive libc routines like memcpy
Depends on the libc. I believe FreeBSD's libc avoids AVX to avoid the additional context switching cost for libc-using programs that don't already use AVX.
All Skylake-X chips actually have the second 512b SIMD unit, including the i7s. Intel's initial documentation here was incorrect, InstLatX64 determined that both units are enabled.
AMD Zen has 4 128-bit SIMD units, while Skylake-S (and earlier) Intel chips have 3 256-bit SIMD units, and Skylake-X/SP (hereafter SKX) chips additionally have 1 or 2 512-bit SIMD units (which overlap with the 256-bit units).
Now not all units can run all instructions. E.g., Intel chips run FP instructions on only 2 of those units, but AMD can run FP instructions on all 4, so in that sense they are even.
However, not all FP instructions can run on all units on AMD: FP multiplications run on only 2 units, and FP additions run on two units - so if you are doing all multiplies, both AMD and Intel can do 2 per cycle and since Intel is twice (quadruple on SKX with 2 FMA units) as wide, then Intel is twice as fast. If you do a 1:1 mix of mul and add, however, then AMD and Intel may be tied - but then AMD is further hamstrung by only have 2 128-bit load units, vs Intel's 2 256-bit load units (512-bit on SKX) - so it is entirely possible that many kernels are limited by load/store throughput on AMD.
For things like in-lane shuffles, AMD and Intel (pre-SKX) have the same throughput: AMD has 2 128-bit shuffle units, and Intel 1 256-bit one. For cross-lane shuffles the 128-bit AMD units really struggle since multiple ops are required and Intel wins big.
So AMD's AVX-256 perf being half of Intel's is more or less the worst case, and many cases will see closer performance. If Zen 2 doubles everything to 256-bits everything will change dramatically.