The difference is that Pi is apparently normal, up to all extensive existing evidence. The Universe is not normal. Anyway, Physics doesn't preclude Helms Deep.
I think normalness is a stronger property than "possesses all sequences of the alphabet of its digits", since it means that + "all subsequences are equally likely"
He claims 100% compression, but that clearly disregards the digit ID which is potentially quite large to store. Any idea as to how to compute stats for actual compression (size(content)/size(index))?
Assuming that pi is a normal number [1], its digits are uniformly distributed. Therefore the density of a given binary sequence of length N in pi’s binary expansion is 2^-N.
So on average, the offset for a given binary sequence will be on the order of 2^N (I’ve used some hand-waving here). You need N bits for that, you you’ve gained basically nothing.
Your calculation works backward from already knowing the index.
I assume jgtrosh's question is more of a probability and estimation question. E.g. the string "Hello world" takes up 11 bytes and what's the size of the index into pi? To generalize a little more, to store any 11-byte sequence, what is the estimated size to store its index? To find the exact 256^11 byte sequence in pi, there will be many cases where the sizeof(index) is greater than the data itself.
> He claims 100% compression, but that clearly disregards the digit ID which is potentially quite large to store
That reminds me of a compression program I once wrote. It had the following properties:
1. It accepts any non-empty file other than a file consisting of a single 0 byte.
2. It is easily reversible.
3. The output is never larger than the input.
4. The output is smaller than the input for some files.
5. If a file is larger than 1 byte, iterating enough times will always eventually produce a smaller file. What is "enough times" depends on the file content.
(I wrote this partly as a joke, and partly as a counterexample to some proofs I'd seen on Usenet about the limitations of compression programs where the proofs had not been sufficiently precise in specifying the conditions necessary for them to apply).
It is actually quite simple:
1. If the file is empty or a single byte that is 0, report that the input is not acceptable and exit.
2. If the file is N bytes, treat it as if it is an 8N bit unsigned integer, I.
3. If I is 0, the output file is N-1 bytes all with value 0xFF.
4. If I is not 0, subtract 1, and output the result.
(Decompression is left as an exercise for the reader).
Another way to look at this is to imagine that we have a sorted list of all possible files. The primary sort key is file size. The secondary sort key is the numerical value of the file when treated as an 8N bit unsigned integer, where N is the file size.
My compression algorithm is simply to replace each file with the file immediately ahead of it on that file list.
Viewed that way, it is pretty obvious that all my claims are true. It accepts any non-empty file other than the first file on the list (single 0 byte). To reverse it, replace the input with the file immediately after it from the list. The list is ordered by file size, so stepping down in the list never gives a larger file, and sometimes gives a smaller file. The output file is on the list, so you can obviously iterate.
The catch, of course, is that while yes, you can take your 1 TB of arbitrary data (even true random data!) and apply my algorithm iteratively to get down to, say, a 1 byte file, and yes, you can apply the decompression algorithm iteratively to recover your 1 TB from, to do that you have to know how many times to iterate on the decompression end.
The number of iterations required for an uncompressed file of N bytes to compress down to a byte is on the order of 2^(8N), so you will need about N bytes to store the iteration count.
So net result is that it can replace an N byte file with a 1 byte file...but it is going to drop N bytes of metadata on you that you'll need to store somewhere.
That's not necessarily true and probably is false. What we know about Pi is that it includes an infinite number of elements, but this says nothing about the structure of those elements. An infinitely long random sequence will probably hold every pattern possible (I'm not a mathematician so correct me if I'm wrong) but any structured or repeating sequence doesn't.
> An infinitely long random sequence will probably hold every pattern possible (I'm not a mathematician so correct me if I'm wrong) but any structured or repeating sequence doesn't.
This is not true. Pi is probably irrational, which means we know it doesn't repeat itself. However, it is not proven to be normal, so it's a conjecture (but not fact) that it contains every finite sequence of digits at some point.
Specifically, if pi contains itself, that means "pi - pi/b^k" is a terminating fraction N in base b", where b is your choice of base (like base 10), and k is some positive integer. So
pi = N/(1-b^k) which is rational.
(And if you set k=0, you get that pi "contains itself" in the sense that pi from the 1st digit onward is of course pi)
You’re probably thinking of set theory but the file system is about storing information in a sequence and an offset. Seen that way, π can be stored with offset 0.
If it's normal then it contains all finite strings uniformly in its base-b expansions. Pi is not a finite string so it wouldn't have to contain itself uniformly. More specifically if it contains itself then it's a rational number, but we know that pi is not a rational number. Note that we don't know whether pi is a normal real number.
Undortunately with this solution you’re still storing the same amount of information as you’d normally do. It’s just encoded as offsets into pi’s decimal expansion.
You can’t cheat entropy.
Whether that is true is related to the normalness [1] of the number pi, but this property has not been proven. See also: [2].
[1] https://en.wikipedia.org/wiki/Normal_number
[2] http://www.askamathematician.com/2009/11/since-pi-is-infinit...