|
|
|
|
|
by nemequ1729
2149 days ago
|
|
I hadn't seen that post before, thanks. This doesn't quite apply to SIMDe; the problem that post is talking about is really at a higher level… whether it is faster to do a bunch of shuffles or use some scalar code. Once SIMDe is called you've already made your decision, and at that level the hardware-based shuffles are much faster than scalar code. For example, see the decompression speed benchmarks for LZSSE-SIMDe (<https://github.com/nemequ/LZSSE-SIMDe>) (they're in the README). It sounds like what that post really needs is a fast 16-bit gather operation. AVX2 has some 32-bit gather functions which you may be usable (2 gathers + a blend could emulate 16-bit gathers). For NEON, you could probably use one of the `vtbl` functions; they're all 8-bit, but that just means you have separate index entries for high and low bytes… it's a bit more code, but there shouldn't be any runtime overhead. |
|
The goal was to emulate a 4-way PowerPC TLB on x86-64. Four uint32_t values had to be compared to find a match. The data structure was roughly "uint32_t array[512][4][4]", laid out so that the 4 uint32_t values would be adjacent.
It simply didn't perform OK. Getting the equality test results out of SIMD was lengthy, awkward, and slow.
That task was so perfect for SIMD, and yet SIMD failed at it. The data was the exactly correct size of an SSE XMM register. It was aligned. The task was a simple parallel operation.