| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tzs 2978 days ago

> He claims 100% compression, but that clearly disregards the digit ID which is potentially quite large to store

That reminds me of a compression program I once wrote. It had the following properties:

1. It accepts any non-empty file other than a file consisting of a single 0 byte.

2. It is easily reversible.

3. The output is never larger than the input.

4. The output is smaller than the input for some files.

5. If a file is larger than 1 byte, iterating enough times will always eventually produce a smaller file. What is "enough times" depends on the file content.

(I wrote this partly as a joke, and partly as a counterexample to some proofs I'd seen on Usenet about the limitations of compression programs where the proofs had not been sufficiently precise in specifying the conditions necessary for them to apply).

It is actually quite simple:

1. If the file is empty or a single byte that is 0, report that the input is not acceptable and exit.

2. If the file is N bytes, treat it as if it is an 8N bit unsigned integer, I.

3. If I is 0, the output file is N-1 bytes all with value 0xFF.

4. If I is not 0, subtract 1, and output the result.

(Decompression is left as an exercise for the reader).

Another way to look at this is to imagine that we have a sorted list of all possible files. The primary sort key is file size. The secondary sort key is the numerical value of the file when treated as an 8N bit unsigned integer, where N is the file size.

My compression algorithm is simply to replace each file with the file immediately ahead of it on that file list.

Viewed that way, it is pretty obvious that all my claims are true. It accepts any non-empty file other than the first file on the list (single 0 byte). To reverse it, replace the input with the file immediately after it from the list. The list is ordered by file size, so stepping down in the list never gives a larger file, and sometimes gives a smaller file. The output file is on the list, so you can obviously iterate.

The catch, of course, is that while yes, you can take your 1 TB of arbitrary data (even true random data!) and apply my algorithm iteratively to get down to, say, a 1 byte file, and yes, you can apply the decompression algorithm iteratively to recover your 1 TB from, to do that you have to know how many times to iterate on the decompression end.

The number of iterations required for an uncompressed file of N bytes to compress down to a byte is on the order of 2^(8N), so you will need about N bytes to store the iteration count.

So net result is that it can replace an N byte file with a 1 byte file...but it is going to drop N bytes of metadata on you that you'll need to store somewhere.