Hacker News new | ask | show | jobs
by dragontamer 990 days ago
> It's not totally obvious to me that there is not some shuffle or permute facility that can load 64 bytes at a time from LUTs.

You are talking about vgather and/or vscatter, which are well known to be very slow AVX2 or AVX512 instructions.

Maybe a future CPU will make these instructions high performance. But no modern 2023-era CPU has a high-speed vgather.

Like: the vgather does one-at-a-time slow. You basically lose parallelism even if the vgather instruction describes what you want to do, it's not an effective parallel operation today (and may never be)

--------

Pshufb as a 4-bit LUT is an exception and is effectively a high speed (but very very small) lookup table. Like every cycle 64 bytes at a time fast.

You are limited by the 16-byte lookup size (aka the size of an SSE register), maybe a bit bigger if there are new instructions I dunno about.