| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by psi-squared 3312 days ago

> I suspect the results in this paper could be improved with more modern gather techniques on newer x86-64 processors.

By that, do you mean the AVX2 'gather' type instructions? If not, I'd be interested to know what those techniques are.

As for AVX2 gathers, I had to look this up recently and it sounds like they're about as fast as manually unpacking the vector and performing scalar loads. That is to say, they're decidedly not fast. On the other hand, it sounds like (as of Skylake) they're bottlenecked on accesses to the L1 cache, so they're about as fast as they reasonably could be.

Source: https://stackoverflow.com/questions/21774454/how-are-the-gat...

Not sure about performance on Zen, but I would imagine it's similar?

1 comments

jbapple 3312 days ago

> do you mean the AVX2 'gather' type instructions?

Yes, and AVX512, with two caveats:

1. This is processor-dependent. Different processors have different ratios of instruction latency.

2. Sometimes a related instruction is faster than the instruction specifically for that purpose. For instance, on the glibc on my machine, memcmp and strncmp are NOT implemented using the sse4.2 instructions for string comparison, but instead use ptest and pcmpeq, respectively, because it is faster to do so. The same phenomenon could be true of gathers as well.