Hacker News new | ask | show | jobs
by lifthrasiir 1383 days ago
I think you have missed the original context. The very problem proposed by the OP is to estimate the compression level for given compressed stream. For zlib and gzip there are 9 possible levels (excluding the uncompressed level 0 for zlib), so the naive approach is to first decompress the input and compress it again in nine different levels. I said that we can instead do the partial decompression to collect statistics and rule some levels out, so that we only need to compress it much less than 9 times. You still need to do the full decompression to be able to recompress it anyway, but that wasn't the point at all.
1 comments

There is no way to "rule some levels out". The levels only differ in the quality of the lz matches. Slower levels search further potential matches - you cannot see that from a compressed stream.

You can compare the output from each level and that will be your guess, assuming of course that zlib was used in the first place.

> The levels only differ in the quality of the lz matches. Slower levels search further potential matches - you cannot see that from a compressed stream.

If you see a match that would not be visible from given compression level, or conversely if you don't see a match that should have been selected at given compression level, then it is clear that the level couldn't be used to make that compressed stream. I think preflate actually does this, maintaining multiple hash tables over the course.

That said, I see where my explanation was slightly off in hindsight (sorry!). A bitstream only composed of blocks with the static Huffman table can be produced in multiple ways. One such case is of course a very specific input that is more beneficial to not use dynamic Huffman tables (that said, zlib tallying is AFAIK greedy so this should be very rare). Another case is Z_FIXED, which was actually what I had in mind at that writing---I should have said compression parameters instead. But I stand by the claim that many parameter combinations can be excluded from partial or full decompression, and for the practical purpose this is enough to be classified as the "estimation".

> If you see a match that would not be visible from given compression level[...]

Yes, but you'd need to reconstruct the hash table and the link table. For that you'd need to decompress fully. You could compare matches, but you are so far along at this point you might as well just decompress and recompress, comparing the output bit by bit.