Hacker News new | ask | show | jobs
by josmala 2119 days ago
As a Zen2 owner I'm very disappointed in VPGATHERDD througput, that's so 2013. On the other hand I like the loop and call instruction performance a lot.
2 comments

That gather needs to issue 8 independent loads. It's never going to be fast, and I think there's a strong argument that you don't even want to spend the transistors on all the extra load/store units required. The goal of scatter/gather instructions is that they should be demonstrably faster than assembling the values in scalar code, and beyond that... meh.

If you're doing random access to memory like that, you're probably out of the realm of what is appropriate in vector code and should be looking at other hardware (c.f. a GPU's texture units) to manage your memory access.

Don't GPUs optimize gathers within the same cache line to effectively be a single fetch from memory and then shuffle? I would assume that's the purpose of VPGATHERDD: not so much for a vector of addresses like (0x1000, 0x2000, 0x3000, 0x4000), where there's no alternative other than to issue 4 loads, but rather for a vector of nearby addresses like (0x2010, 0x2004, 0x2008, 0x2000), where the CPU can coalesce the fetches into one (like PSHUFD with a memory operand does). Gather instructions are especially good when your addresses are usually in the same cache line, but don't have to be—stuff like mipmapped texture lookups in fragment shaders.
I don't think that's the case on most CPUs; VPGATHERDD on Skylake, Icelake, etc, all issue the same 4/8/16 port2,3 uops regardless of what the addresses are.
Its theoretically possible to run gather as fast as one cache line per cycle instead of one SIMD lane per cycle. I don't think anyone has thrown that much permute hardware at the problem, though. Its only profitable if you believe that scatter and gather do have cache locality even when they don't have regularity.
I believe the Xeon Phi series implemented gather this way.
Isn't AVX512 basically cacheline-instructions?
That's the way normal SIMD loads work, yeah.

But the scatter/gather instructions do random access memory operations. You have one SIMD register with a 8 (or whatever the width is) indexes to be applied to a base address in a scalar register, and the hardware then goes and does 8 separate memory operations on your behalf, packing the results into a SIMD register at the end.

That has to hit the cache 8 times in the general case. It's extremely expensive as a single instruction, though faster than running scalar code to do the same thing.

I looked Agner's tables, and was curious how Intel fared with it. All numbers are reciprocal throughput. So how many cycles per instruction in throughput. zen 2 has it's gather variants mostly 9 and 6 cycles and one variant with 16. Broadwell has only 6,7 and 5 cycles. Skylake has mostly 4 and 2 and one variant with 5.

Now I was surprised by Agners figures for zen2 LOOP and CALL which both have reciprocal throughput of 2. Being equal to doing with just normal jump instructions.

Skylake on the other hand has 5 or 6 for LOOP and two CALL variants with 3 and one variant with 2.

VPGATHER has always been pretty trashy on AMD. Do they microcode it?

Edit: The answer from LLVM's source code is yeah, gather loads are microcoded on Zen.