Hacker News new | ask | show | jobs
by terrelln 147 days ago
> | 1.1M | 2.0M | 1.1M | 1.1M | 1.1M |

Something is going terribly wrong with `zstd` here, where it is reported to compress a file of 1.1MB to 2MB. Zstd should never grow the file size by more than a very small percent, like any compressor. Am I interpreting it correctly that you're doing something like `zstd -22 --ultra $FILE && wc -c $FILE.zst`?

If you can reproduce this behavior, can you please file an issue with the zstd version you are using, the commands used, and if possible the file producing this result.

1 comments

Okay now this is weird.

I can reproduce it just fine ... but only when compressing all PDFs simultaneously.

To utilize all cores, I ran:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22 & done; wait
(and similar for the other formats).

I ran this again and it produced the same 2M file from the source 1.1M file. However when I run without paralellization:

    $ for x in *.pdf; do zstd <"$x" >"$x.zst" --ultra -22; done
That one file becomes 1.1M, and the total size of *.zst is 37M (competitive with Brotli, which is impressive given how much faster it is to decompress).

What's going on here? Surely '-22' disables any adaptive compression stuff based on system resource availability and just uses compression level 22?

Yeah, `--adaptive` will enable adaptive compression, but it isn't enabled by default, so shouldn't apply here. But even with `--adaptive`, after compressing each block of 128KB of data, zstd checks that the output size is < 128KB. If it isn't, it emits an uncompressed block that is 128KB + 3B.

So it is very central to zstd that it will never emit a block that is larger than 128KB+3B.

I will try to reproduce, but I suspect that there is something unrelated to zstd going on.

What version of zstd are you using?

'zstd --version' reports: "** Zstandard CLI (64-bit) v1.5.7, by Yann Collet **". This is zstd installed through Homebrew on macOS 26 on an M1 Pro laptop. Also of interest, I was able to reproduce this with a random binary I had in /bin: https://floss.social/@mort/115940378643840495

I was completely unable to reproduce it on my Linux desktop though: https://floss.social/@mort/115940627269799738

I've figured out the issue. Use `wc -c` instead of `du`.

I can repro on my Mac with these steps with either `zstd` or `gzip`:

    $ rm -f ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    1.2M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    $ zstd < /bin/ksh > ksh.zst
    $ du -h ksh.zst
    2.0M ksh.zst
    $ wc -c ksh.zst
     1240701 ksh.zst
    
    $ rm -f ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    1.2M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
    $ gzip < /bin/ksh > ksh.gz
    $ du -h ksh.gz
    2.1M ksh.gz
    $ wc -c ksh.gz
     1246815 ksh.gz
When a file is overwritten, the on-disk size is bigger. I don't know why. But you must have ran zstd's benchmark twice, and every other compressor's benchmark once.

I'm a zstd developer, so I have a vested interest in accurate benchmarks, and finding & fixing issues :)

Interesting!

It doesn't seem to be only about overwriting, I can be in a directory without any .zst files and run the command to compress 55 files in parallel and it's still 45M according to 'du -h'. But you're right, 'wc -c' shows 38809999 bytes regardless of whether 'du -h' shows 45M after a parallel compression or 38M after a sequential compression.

My mental model of 'du' was basically that it gives a size accurate to the nearest 4k block, which is usually accurate enough. Seems I have to reconsider. Too bad there's no standard alternative which has the interface of 'du' but with byte-accurate file sizes...

Yeah, it isn't quite that simple. E.g. `/bin/ksh` reports 1.4MB, but it is actually 2.4MB. Initially, I thought it was because the file was sparse, but there are only 493KB of zeros. So something else is going on. Perhaps some filesystem-level blocks are deduped from other files? Or APFS has transparent compression? I'm not sure.

It does still seem odd that APFS is reporting a significantly larger disk-size for these files. I'm not sure why that would ever be the case, unless there is something like deferred cleanup work.

doesn't zstd cap out at compression level 19?
From the man page:

    --ultra: unlocks high compression levels 20+ (maximum 22), using a lot more memory.
Regardless, this reproduces with random other files and with '-9' as the compression level. I made a mastodon post about it here: https://floss.social/@mort/115940378643840495