Hacker News new | ask | show | jobs
by paxys 644 days ago
Reading through the post they seem to have been hyper focused on compression ratios and reducing the payload size/network bandwidth as much as possible, but I don't see a single mention of CPU time or evidence of any actual measureable improvement for the end user. I have been involved with a few such efforts at my own company, and the conclusion always was that the added compression/decompression overhead on both sides resulted in worse performance. Especially considering we are talking about packets at the scale of bytes or a few kilobytes at most.
10 comments

The explicitly mention compression time. It’s actually lower in the new approach.

> the compression time per byte of data is significantly lower for zstandard streaming than zlib, with zlib taking around 100 microseconds per byte and zstandard taking 45 microseconds

I am fairly sure those should have been per kilobyte, not per byte.
Those are atrocious numbers, that's only 23kB/s for the faster variant. It should have been GB/s not kB.
For what it’s worth, the benchmark on the Zstandard homepage[1] shows none of the setups tested breaking 1GB/s on compression, and only the fastest and sloppiest ones breaking 1GB/s on decompression. If you can live with its API limitations, libdeflate is known[2] to squeeze past 1GB/s decompressing normal Deflate compression levels. In any case, asking for multiple GB/s is probably unfair.

Still, looking at those benchmarks, 10MB/s sounds like the absolute minimum reasonable speed, and they’re reporting nearly three orders of magnitude below that. A modern compressor does not run at mediocre dialup speeds; something in there is absolutely murdering the performance.

And I’m willing to believe it’s just the constant-time overhead. The article mentions “a few hundred bytes” per message payload in a stream of messages, and the actual data of the benchmarks implies 1.6KB uncompressed. Even though they don’t reinitialize the compressor on each message, that is still a very very modest amount of data.

So it might be that general-purpose compressors are simply a bad tool here from a performance standpoint. I’m not aware of a good tool for this kind of application, though.

[1] https://facebook.github.io/zstd/#benchmarks

[2] https://github.com/zlib-ng/zlib-ng/issues/1486

One thing to note is that on a given gateway server there are potentially 100k other compression contexts active, and given each connection is transmitting a trickle of small data in an unpredictable way, from different CPU cores as the processes are scheduled by the erlang VM, chances are the CPU caches are absolutely being thrashed. I imagine this contributes to some level of fixed overhead here too, especially when you're measuring these timings on a machine serving actual production traffic as opposed to simply running a bunch of small payloads through a single compressor.
It’s possible, I guess, but it wouldn’t be my first thought. It’s too slow for that.

A payload of 1.6KB at 45us/B is 75ms, which is below the typical scheduling quantum of about 100ms. (Can’t say anything about Erlang, let alone Erlang bindings to C libraries, but I wouldn’t expect it to be that much smaller either, precisely because of the switching overhead both direct and indirect.) So a single compression operation shouldn’t be getting preempted enough to affect the results.

Typical RAM bandwidth is tens of GB/s (even consumer-class SSDs[1] are single-digit GB/s) so with tens to hundreds of cores that’s not enough to affect anything, and even taking into account the compressor’s window is measured in megabytes not kilobytes that’s likely not enough (it would be a bad compressor that reread its whole window each time, anyway). And the data we’re compressing is not only minuscule, it has just been generated and is virtually guaranteed to be cached.

Honestly, I almost want to say that the benchmark is measuring the wrong thing somehow, except they’re reporting a 2× speedup switching from one compressor to another. So it can’t be the JSON encoding overhead or whatnot, and, unless one of the Erlang bindings is somehow drastically stupider than the other, it shouldn’t be the FFI overhead, and even those are a huge stretch. The Flying Spaghetti Monster be merciful, I cannot see anything here that we could be spending over a hundred million cycles on.

At this point I’m hoping somebody just mixed up the units, because this is really unsettling.

[1] https://lemire.me/en/talk/perfsummit2020/

First, let's establish a cheery mood: Happy Friday!!!!

Second, I noticed we're extrapolating from a tossed out measurement in "microseconds per byte" here, of extremely small payloads, probably included fixed-cost overhead of doing anything at all.

All leading up to: Is "atrocious" the right word choice here? :)

More directly: do you really think Discord rolled out a compression algorithm that does 23 KB/s for payloads in the megabytes?

Even more directly, avoiding being passive and just adopting your tone: this is atrocious analysis that glibly chooses to create obviously wrong numbers, then criticizes them as if they are real.

> More directly: do you really think Discord rolled out a compression algorithm that does 23 KB/s for payloads in the megabytes?

yes, actually

that's "yngmi" bait; you're suggesting it takes 2 minutes per intro payload. (2 MB / 20 KB/s ≈ 100 seconds = 1m40s)
I think one thing this blog post did not mention was the historical context of moving from uncompressed to compressed traffic (using zlib), something I worked on in 2017. IIRC, the bandwidth savings were massive (75%). It did use more server side CPU, and negligible client side CPU, so we went for it anyways as bandwidth is a very precious thing to optimize for especially with cloud bandwidth costs.

Either way the incremental improvements here are great - and it's important to consider optimization both from transport level (compression, encoding) and also from a protocol level (the messages actually sent over the wire.)

Also one thing not mentioned is client side decompression on desktop used to use a JS implementation of zlib (pako) to a native implementation, that's exposed to the client via napi bindings.

Can't remember last time I had to worry about bandwidth for the servers. It only came up when talking about iPhones in the start because everyone was on realy slow mobile networks. Our company is usually very cost sensitive but all our server hotels so far has had unmetered bandwidth connected to a 100 Mbps interface. Has had zero complaints during the last 20 years even though we fill that one now and then.

But we don't use any of the usual cloud offerings, only smaller local companies.

> zstandard streaming significantly outperforms zlib both in time to compress and compression ratio.

Time to compress is a measure of how long the CPU spends compressing. So this is in the blogpost

I think the person is concerned with client-side compute, not just server-side compute. The article does not mention whether zstd has additional decompression overhead compared to zlib.

Client-side compute may sound like a contrived issue, but Discord runs on a wide variety of devices. Many of these devices are not necessarily the latest flagship smartphones, or a computer with a recent CPU.

I am going to guess that zstd decompression is roughly as expensive as zlib, since (de)compression time was a motivating factor in the development of zstd. Also the reason to prefer zstd over xz, despite the latter providing better compression efficiency.

zstd has faster decompression

though I always thought lz4 to be the sweet spot for anything requiring speed, somewhat less compression ratio in exchange for very fast compression and decompression

> I don't see a single mention of CPU time

> Looking once again at MESSAGE_CREATE, the compression time per byte of data is significantly lower for zstandard streaming than zlib, with zlib taking around 100 microseconds per byte and zstandard taking 45 microseconds.

That's compression time, the parent is talking about the end user so we want decompression time instead.
Zstd is generally known for markedly faster decompression (as well as compression) than zlib.

https://gregoryszorc.com/blog/2017/03/07/better-compression-...

I imagine it correlates with the end-user device specs, no?
Do the packets transmit through Discord's servers? Reducing their bill may be more important to them than user performance.
They're going from 2+MB (for some reason) to 300KB - even if decompression is "slow," that's going to be a win for their bandwidth costs and for perceived speed for _most_ users.

I was surprised to see little server-side CPU benchmarking too, though. While I'd expect overall client timing for (transfer + decompress) to be improved dramatically unless the user was on a ridiculously fast network connection, I can't imagine server load not being affected in a meaningful way.

There already was compression before, through zlib. The findings, as showed in the post, was that Zstandard was also a lot more efficient than zlib from a cpu time standpoint.
The 2mb case is pathological - an account on MANY servers with no local cache state (the READY payload works to only send data that's changed between when you've reconnected by having the client send hashes of data it knows.)
Bandwidth costs for text messages, maybe, but how much data is that really compared to images, audio, video or even just the app's JS bundle?
Presumably that's all CDNed and therefore a lot cheaper to serve.
Probably not for live shared audio/video
The bandwidth probably doesn't really matter, but a 2MB must have blob vs a 300kB must have blob at the start of a connection is a big difference.

The start of a tcp connection is limited by round trip times more than bandwidth. Especially for mobile, optimizing to reduce the number of round trips required is pretty handy.

It's almost certainly about hosting costs, not user facing value.
Some of those payloads are much larger than a few kilobytes (READY, MESSAGE_CREATE etc.) There is a section and data on "time to compress". No time to decompress though.
They might have been getting murdered by egress fees in which case they'd be willing to make that sacrifice
Performance is probably the wrong lens. Mobile data is often expensive in terms of money, whereas compression is cheap in terms of CPU time. More compression is almost always the right answer for users of mobile apps.