| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pmh91 1112 days ago

That's a good question.

For the pure-LZ77 Iguana variant (no rANS encoding), most of the decoding time is spent moving memory around rather than decoding the match+length tuples from the input stream, which suggests the performance difference wouldn't be that great if we were only on AVX2, but AVX-512 has a bunch of instructions that are super helpful for parsing our base254 integers quickly. If I had to take a wild guess I'd say it would cost an additional 15%.

One sacrifice we made in the design is that the minimum match offset distance is always 32 bytes, which means we can always perform a literal or match copy by starting with a ymm register load + store. This hurts the compression ratio a bit but it helps performance immensely, and for that reason alone I suspect we'd still come out ahead of lz4 even without AVX-512.

1 comments

hmottestad 1112 days ago

I remember reading that AVX-512 hurt the ability of Intel CPUs to turbo and run other tasks in parallel. This was a few years ago and I would hope it’s not the case anymore, especially since AMD has managed to add AVX-512 support too.

Have you done any testing with running multiple decompression tasks in parallel, or just running a single decompression task while at the same time running other tasks like maybe a web server?

Asm2D 1112 days ago

The initial AVX-512 implementation brought a lot of issues with it. The biggest problem was that Intel used 512-bit ALUs from the beginning and I think it was just too much that time (initial 14nm node) - even AMD's Zen4 architecture, which came years after Skylake-X, uses 256-bit ALUs for most of the operations except complex shuffles, which use a dedicated 512-bit unit to make them competitive. And from my experience, AMD's Zen4 AVX-512 implementation is a very competitive one. I just wish it had faster gathers.

Our typical workload at Sneller uses most of the computational power of the machine: we typically execute heavy AVX-512 workloads on all available cores and we compare our processing performance at GB/s per core. This is generally why we needed a faster decompression, because before Iguana almost 50% of the computational power was spent in a zstd decompressor, which is scalar. The rest of the code is written in Go, but it's insignificant compared to how much time we spend executing AVX-512 now.

(I work for Sneller)