They don’t seem to provide a detailed comparison showing how each compression scheme fared at every task, but they do list (some of) their criteria and say they found Brotli the best of the bunch. I can’t tell if that’s a sensible conclusion or not, though. Maybe Brotli did better on code size or memory use?
Hey, they did all the work and more, trust them!!!
> Experts in the PDF Association’s PDF TWG undertook theoretical and experimental analysis of these schemes, reviewing decompression speed, compression speed, compression ratio achieved, memory usage, code size, standardisation, IP, interoperability, prototyping, sample file creation, and other due diligence tasks.
I love when I perform all the due diligence tasks. You just can't counter that. Yes but, they did all the due diligence tasks. They considered all the factors. Every one. Think you have one they didn't consider? Nope.
EDIT: Something weird is going on here. When compressing zstd in parallel it produces the garbage results seen here, but when compressing on a single core, it produces result competitive with Brotli (37M). See: https://news.ycombinator.com/item?id=46723158
I just took all PDFs I had in my downloads folder (55, totaling 47M). These are invoices, data sheets, employment contracts, schematics, research reports, a bunch of random stuff really.
I compressed them all with 'zstd --ultra -22', 'brotli -9', 'xz -9' and 'gzip -9'. Here are the results:
Zstd is surprisingly bad on this data set. I'm guessing it struggles with the already-compressed image data in some of these PDFs.
Going by only compression ratio, brotli is clearly better than the rest here and zstd is the worst. You'd have to find some other reason (maybe decompression speed, maybe spec complexity, or maybe you just trust Facebook more than Google) to choose zstd over brotli, going by my results.
I wish I could share the data set for reproducibility, but I obviously can't just share every PDF I happened to have laying around in my downloads folder :p
Turns out that these numbers are caused by APFS weirdness. I used 'du' to get them which reports the size on disk, which is weirdly bloated for some reason when compressing in parallel. I should've used 'du -A', which reports the apparent size.
Here's a table with the correct sizes, reported by 'du -A' (which shows the apparent size):
Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?
If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.
I can reproduce it just fine ... but only when compressing all PDFs simultaneously.
To utilize all cores, I ran:
$ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait
(and similar for the other formats).
I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:
$ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done
That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).
What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?
Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.
So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.
I will try to reproduce, but I suspect that there is something unrelated to zstd going on.
'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495
Why not use a more widespread compression algorithm (e.g. gzip) considering that Brotli barely performs better at all? Sounds like a pain for portability
I'm not sold on the idea of adding compression to PDF at all, I'm not convinced that the space savings are worth breaking compatibility with older readers. Especially when you consider that you can just compress it in transit with e.g HTTP's 'Content-Encoding' without any special PDF reader support. (You can even use 'Content-Encoding: br' for brotli!)
If you do wanna change PDF backwards-incompatibly, I don't think there's a significant advantage to choosing gzip to be honest, both brotli and zstd are pretty widely available these days and should be fairly easy to vendor. But yeah, it's a slight advantage I guess. Though I would expect that there are other PDF data sets where brotli has a larger advantage compared to gzip.
But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
>But what I really don't get is all the calls to use zstd instead of brotli and treating the choise to use brotli instead of zstd as some form of Google conspiracy. (Is Facebook really better?)
I may dislike Google. But my support of JPEG XL and Zstd has nothing to do with competition tech being Google at all. I simply think JPEG XL and Zstd are better technology.
I just did some interactive shell loops and globs to compress everything and output CSV which I processed into an ASCII table, so I don't exactly have a pipeline I can modify and re-run the tests with compression speeds added ... but I can run some more interactive shell-glob-and-loop-based analysis to give you decompression speeds:
~/tmp/pdfbench $ hyperfine --warmup 2 \
'for x in zst/*; do zstd -d >/dev/null <"$x"; done' \
'for x in gz/*; do gzip -d >/dev/null <"$x"; done' \
'for x in xz/*; do xz -d >/dev/null <"$x"; done' \
'for x in br/*; do brotli -d >/dev/null <"$x"; done'
Benchmark 1: for x in zst/*; do zstd -d >/dev/null <"$x"; done
Time (mean ± σ): 164.6 ms ± 1.3 ms [User: 83.6 ms, System: 72.4 ms]
Range (min … max): 162.0 ms … 166.9 ms 17 runs
Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
Time (mean ± σ): 143.0 ms ± 1.0 ms [User: 87.6 ms, System: 43.6 ms]
Range (min … max): 141.4 ms … 145.6 ms 20 runs
Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
Time (mean ± σ): 981.7 ms ± 1.6 ms [User: 891.5 ms, System: 93.0 ms]
Range (min … max): 978.7 ms … 984.3 ms 10 runs
Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
Time (mean ± σ): 254.5 ms ± 2.5 ms [User: 172.9 ms, System: 67.4 ms]
Range (min … max): 252.3 ms … 260.5 ms 11 runs
Summary
for x in gz/*; do gzip -d >/dev/null <"$x"; done ran
1.15 ± 0.01 times faster than for x in zst/*; do zstd -d >/dev/null <"$x"; done
1.78 ± 0.02 times faster than for x in br/*; do brotli -d >/dev/null <"$x"; done
6.87 ± 0.05 times faster than for x in xz/*; do xz -d >/dev/null <"$x"; done
As expected, xz is super slow. Gzip is fastest, zstd being somewhat slower, brotli slower again but still much faster than xz.
Zstd should not be slower than gzip to decompress here. Given that it has inflated the files to be bigger than the uncompressed data, it has to do more work to decompress. This seems like a bug, or somehow measuring the wrong thing, and not the expected behavior.
It seems like zstd is somehow compressing really badly when many zstd processes are run in parallel, but works as expected when run sequentially: https://news.ycombinator.com/item?id=46723158
Regardless, this does not make a significant difference. I ran hyperfine again against a 37M folder of .pdf.zst files, and the results are virtually identical for zstd and gzip:
~/tmp/pdfbench $ du -h zst2 gz xz br
37M zst2
38M gz
38M xz
37M br
~/tmp/pdfbench $ hyperfine ...
Benchmark 1: for x in zst2/*; do zstd -d >/dev/null <"$x"; done
Time (mean ± σ): 164.5 ms ± 2.3 ms [User: 83.5 ms, System: 72.3 ms]
Range (min … max): 162.3 ms … 172.3 ms 17 runs
Benchmark 2: for x in gz/*; do gzip -d >/dev/null <"$x"; done
Time (mean ± σ): 142.2 ms ± 0.9 ms [User: 87.4 ms, System: 43.1 ms]
Range (min … max): 140.8 ms … 143.9 ms 20 runs
Benchmark 3: for x in xz/*; do xz -d >/dev/null <"$x"; done
Time (mean ± σ): 993.9 ms ± 9.2 ms [User: 896.7 ms, System: 99.1 ms]
Range (min … max): 981.4 ms … 1007.2 ms 10 runs
Benchmark 4: for x in br/*; do brotli -d >/dev/null <"$x"; done
Time (mean ± σ): 269.1 ms ± 8.8 ms [User: 176.6 ms, System: 75.8 ms]
Range (min … max): 261.8 ms … 287.6 ms 10 runs