Hacker News new | ask | show | jobs
by jgtrosh 2979 days ago
He claims 100% compression, but that clearly disregards the digit ID which is potentially quite large to store. Any idea as to how to compute stats for actual compression (size(content)/size(index))?
4 comments

Assuming that pi is a normal number [1], its digits are uniformly distributed. Therefore the density of a given binary sequence of length N in pi’s binary expansion is 2^-N.

So on average, the offset for a given binary sequence will be on the order of 2^N (I’ve used some hand-waving here). You need N bits for that, you you’ve gained basically nothing.

[1] https://en.m.wikipedia.org/wiki/Normal_number

A friend and I tried to do this in highschool and realized it wasn't so useful except for a cool factor.
I think reading a bit farther it becomes clear that it's a parody

Edited to fixed terrible spelling

You mean “parody.” This near-homophone is quite confusing here!
Oops. My only excuse, weak as it is. Is that I typed that on my phone. Fixing it now thanks.
Sure it is! It's an interesting problem nonetheless :-)
Size of the index would just be log2(index), right?
Your calculation works backward from already knowing the index.

I assume jgtrosh's question is more of a probability and estimation question. E.g. the string "Hello world" takes up 11 bytes and what's the size of the index into pi? To generalize a little more, to store any 11-byte sequence, what is the estimated size to store its index? To find the exact 256^11 byte sequence in pi, there will be many cases where the sizeof(index) is greater than the data itself.

Yes but index would be ~2^(length of content)
> He claims 100% compression, but that clearly disregards the digit ID which is potentially quite large to store

That reminds me of a compression program I once wrote. It had the following properties:

1. It accepts any non-empty file other than a file consisting of a single 0 byte.

2. It is easily reversible.

3. The output is never larger than the input.

4. The output is smaller than the input for some files.

5. If a file is larger than 1 byte, iterating enough times will always eventually produce a smaller file. What is "enough times" depends on the file content.

(I wrote this partly as a joke, and partly as a counterexample to some proofs I'd seen on Usenet about the limitations of compression programs where the proofs had not been sufficiently precise in specifying the conditions necessary for them to apply).

It is actually quite simple:

1. If the file is empty or a single byte that is 0, report that the input is not acceptable and exit.

2. If the file is N bytes, treat it as if it is an 8N bit unsigned integer, I.

3. If I is 0, the output file is N-1 bytes all with value 0xFF.

4. If I is not 0, subtract 1, and output the result.

(Decompression is left as an exercise for the reader).

Another way to look at this is to imagine that we have a sorted list of all possible files. The primary sort key is file size. The secondary sort key is the numerical value of the file when treated as an 8N bit unsigned integer, where N is the file size.

My compression algorithm is simply to replace each file with the file immediately ahead of it on that file list.

Viewed that way, it is pretty obvious that all my claims are true. It accepts any non-empty file other than the first file on the list (single 0 byte). To reverse it, replace the input with the file immediately after it from the list. The list is ordered by file size, so stepping down in the list never gives a larger file, and sometimes gives a smaller file. The output file is on the list, so you can obviously iterate.

The catch, of course, is that while yes, you can take your 1 TB of arbitrary data (even true random data!) and apply my algorithm iteratively to get down to, say, a 1 byte file, and yes, you can apply the decompression algorithm iteratively to recover your 1 TB from, to do that you have to know how many times to iterate on the decompression end.

The number of iterations required for an uncompressed file of N bytes to compress down to a byte is on the order of 2^(8N), so you will need about N bytes to store the iteration count.

So net result is that it can replace an N byte file with a 1 byte file...but it is going to drop N bytes of metadata on you that you'll need to store somewhere.