Hacker News new | ask | show | jobs
by dragontamer 990 days ago
Your 8-bit lookup can never be parellized, while the add/cmp/cmov is really easy AVX512 that probably auto-inlines and auto-vectorizes.

I dunno, it's ridiculously a benefit to the code by my instinct. While lookup table looks pretty bad.

2 comments

> can never be parellized

I mean, I'm not skilled enough in those ISA extensions to stick my neck out. It's not totally obvious to me that there is not some shuffle or permute facility that can load 64 bytes at a time from LUTs.

> It's not totally obvious to me that there is not some shuffle or permute facility that can load 64 bytes at a time from LUTs.

You are talking about vgather and/or vscatter, which are well known to be very slow AVX2 or AVX512 instructions.

Maybe a future CPU will make these instructions high performance. But no modern 2023-era CPU has a high-speed vgather.

Like: the vgather does one-at-a-time slow. You basically lose parallelism even if the vgather instruction describes what you want to do, it's not an effective parallel operation today (and may never be)

--------

Pshufb as a 4-bit LUT is an exception and is effectively a high speed (but very very small) lookup table. Like every cycle 64 bytes at a time fast.

You are limited by the 16-byte lookup size (aka the size of an SSE register), maybe a bit bigger if there are new instructions I dunno about.

Vectorized gather loads can do just that. But currently are pretty underwhelming on x86 machines.
Although it can't be easily vectorized, it can be pipelined, that can help both hide the access latency and keep the lookup table hot. The code is not going to be pretty thorough.