Hacker News new | ask | show | jobs
by pcwalton 2119 days ago
Don't GPUs optimize gathers within the same cache line to effectively be a single fetch from memory and then shuffle? I would assume that's the purpose of VPGATHERDD: not so much for a vector of addresses like (0x1000, 0x2000, 0x3000, 0x4000), where there's no alternative other than to issue 4 loads, but rather for a vector of nearby addresses like (0x2010, 0x2004, 0x2008, 0x2000), where the CPU can coalesce the fetches into one (like PSHUFD with a memory operand does). Gather instructions are especially good when your addresses are usually in the same cache line, but don't have to be—stuff like mipmapped texture lookups in fragment shaders.
1 comments

I don't think that's the case on most CPUs; VPGATHERDD on Skylake, Icelake, etc, all issue the same 4/8/16 port2,3 uops regardless of what the addresses are.