| HN Mirror

Source code here: https://github.com/lemire/validateutf8-experiments/tree/mast...

Looks like a very tiny LUT, perhaps it's all in SIMD registers.

In any case, for anything to go at 12 GB/s, the LUT needs to be tiny and pretty much in SIMD register(s).

Any indirect memory accesses would drastically slow things down, even if it's in L1 cache [0]. At least by an order of magnitude.

Even when using something like x86 V[P]GATHER [0] (element-wise vector random access load). X86 gather is currently only slightly better than scalar code, even if all the data is on same cache line present in L1D!

[0]: https://www.felixcloutier.com/x86/vpgatherdd:vpgatherqd