Hacker News new | ask | show | jobs
by pmh91 1112 days ago
Hi! Sneller CTO here.

Thanks for pointing out the dependency on the AGPL bits -- I'm going to fix that ASAP. (There's just a small utility library we've got to re-license as Apache-2 as well.)

Once we're comfortable committing to the format for the long haul, we're planning on publishing a C implementation as well. There are still a few tweaks we've been evaluating to try to make the compression ratio a tiny bit more competitive.

1 comments

this is very cool. thank you for releasing it! if you had to guess, how much of the performance depends on avx512 specifically? if this could run reasonably well on avx2, IMO this would be a really great general successor to LZ4
That's a good question.

For the pure-LZ77 Iguana variant (no rANS encoding), most of the decoding time is spent moving memory around rather than decoding the match+length tuples from the input stream, which suggests the performance difference wouldn't be that great if we were only on AVX2, but AVX-512 has a bunch of instructions that are super helpful for parsing our base254 integers quickly. If I had to take a wild guess I'd say it would cost an additional 15%.

One sacrifice we made in the design is that the minimum match offset distance is always 32 bytes, which means we can always perform a literal or match copy by starting with a ymm register load + store. This hurts the compression ratio a bit but it helps performance immensely, and for that reason alone I suspect we'd still come out ahead of lz4 even without AVX-512.

I remember reading that AVX-512 hurt the ability of Intel CPUs to turbo and run other tasks in parallel. This was a few years ago and I would hope it’s not the case anymore, especially since AMD has managed to add AVX-512 support too.

Have you done any testing with running multiple decompression tasks in parallel, or just running a single decompression task while at the same time running other tasks like maybe a web server?

The initial AVX-512 implementation brought a lot of issues with it. The biggest problem was that Intel used 512-bit ALUs from the beginning and I think it was just too much that time (initial 14nm node) - even AMD's Zen4 architecture, which came years after Skylake-X, uses 256-bit ALUs for most of the operations except complex shuffles, which use a dedicated 512-bit unit to make them competitive. And from my experience, AMD's Zen4 AVX-512 implementation is a very competitive one. I just wish it had faster gathers.

Our typical workload at Sneller uses most of the computational power of the machine: we typically execute heavy AVX-512 workloads on all available cores and we compare our processing performance at GB/s per core. This is generally why we needed a faster decompression, because before Iguana almost 50% of the computational power was spent in a zstd decompressor, which is scalar. The rest of the code is written in Go, but it's insignificant compared to how much time we spend executing AVX-512 now.

(I work for Sneller)