Looks like a very tiny LUT, perhaps it's all in SIMD registers.
In any case, for anything to go at 12 GB/s, the LUT needs to be tiny and pretty much in SIMD register(s).
Any indirect memory accesses would drastically slow things down, even if it's in L1 cache [0]. At least by an order of magnitude.
Even when using something like x86 V[P]GATHER [0] (element-wise vector random access load). X86 gather is currently only slightly better than scalar code, even if all the data is on same cache line present in L1D!
Looks like a very tiny LUT, perhaps it's all in SIMD registers.
In any case, for anything to go at 12 GB/s, the LUT needs to be tiny and pretty much in SIMD register(s).
Any indirect memory accesses would drastically slow things down, even if it's in L1 cache [0]. At least by an order of magnitude.
Even when using something like x86 V[P]GATHER [0] (element-wise vector random access load). X86 gather is currently only slightly better than scalar code, even if all the data is on same cache line present in L1D!
[0]: https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd