|
|
|
|
|
by Remnant44
246 days ago
|
|
I'm just happy that finally, with the popularity of zen4 and 5 chips, AVX512 is around ~20% of the running hardware in the steam hardware survey. It's going to be a long while before it gets to a majority - Intel still isn't shipping its own instruction set in consumer CPUs - but its going the right direction. Compared to the weird, lumpy lego set of avx1/2, avx512 is quite enjoyable to write with, and still has some fun instructions that deliver more than just twice the width. Personal example: The double width byte shuffles (_mm512_permutex2var_epi8) that takes 128 bytes as input in two registers. I had a critical inner loop that uses a 256 byte lookup table; running an upper/lower double-shuffle and blending them essentially pops out 64 answers a cycle from the lookup table on zen5 (which has two shuffle units), which is pretty incredible, and on its own produced a global 4x speedup for the kernel as a whole. |
|
Compared to Huff0[1] (used by Zstd), my AVX512 code is currently ~40% faster at both compression and decompression. This requires using 32 datastreams instead of 4 used by Huff0.
[1] https://github.com/Cyan4973/FiniteStateEntropy