Hacker News new | ask | show | jobs
by haser_au 2885 days ago
The same has been said about Pi (3.14). If you can compute, store and search enough the digits of Pi, you can reference anything by just providing the 'start' and 'finish' locations. Unfortunately, with enough digits of Pi, the 'start' and 'finish' numbers can get quite long themselves.
4 comments

Years ago I tried this and basically ended up proving that if Pi is random and you are "compressing" random data, on average the start and finish numbers together are at least as long as the numbers you are trying to "compress."
Did you proove it formally or only by experiment. If it were a real proof, did you publish it somewhere?
In fact, any lossless compression algorithm has the property that the output is (on average) at least as long as the input. The best you can hope for is an algorithm that compresses the kind of data that humans want to store, at the expense of making other data a bit longer. If you're trying to compress random data then you just can't do it.

Here's a proof: consider the strings of length n or less, suppose there are M of them in total. Their average length is just the sum of all their lengths divided by M, and the average length of their compressed versions is just the total length of the compressed versions divided by M. Since the compression is lossless the compressed strings must all be different.

Since there are M strings, if any of them mapped to a string of length more than n then there must be some string of length at most n not being mapped to, so the average length can be improved by instead mapping that string to the shorter string. So any optimal compression method must map only to the strings of length at most n.

So the M outputs are just the M inputs, possibly permuted. So their total length is the same, and hence their average length is the same.

> any lossless compression algorithm has the property that the output is (on average) at least as long as the input.

The article you’ve linked says nothing about average. It says that for every algorithm there’s at least some input files that increase the size. It even explains more about that:

Any lossless compression algorithm that makes some files shorter must necessarily make some files longer, but it is not necessary that those files become very much longer. Most practical compression algorithms provide an "escape" facility that can turn off the normal coding for files that would become longer by being encoded. In theory, only a single additional bit is required to tell the decoder that the normal coding has been turned off for the entire input

Thanks. I realized this just after I posted it, so I wrote the proof into my comment instead.
>In fact, any lossless compression algorithm has the property that the output is (on average) at least as long as the input

I don't think this is true. If it was, lossless compression would be useless in a lot of applications. It's pretty easy to come up with a counter example.

E.g.

(simple huffman code off the top of my head, not optimal)

symbol -> code

"00" -> "0"

"01" -> "10"

"10" -> "110"

"11" -> "111"

If "00" will appear 99.999% of the time, and the other 3 symbols only appear 0.001% of the time, the output will "on average" be slightly more than half the length of the input.

Sure, I'm assuming that (a) you are trying to encode all strings of length at most n and (b) you have the uniform distribution over those strings. This makes sense in the original context of encoding random data.
>you have the uniform distribution over those strings. This makes sense in the original context of encoding random data.

Lossless compression is nothing more than taking advantage of prior knowledge of the distribution of the data you are compressing.

Random data isn't always (or even often) uniformly distributed. Everything we compress is "random" (in the context of information theory), so I disagree that it makes sense to assume uniformly distributed data.

i did somewhat the same thing. it introduced me to programming... i thought how about multiple start indices and fixed width? you can then compress the list of start indices in the same manner until you reach sufficient compression :D
http://www.angio.net/pi/

My 11 digit phone number occurs around the 115 millionth digit, a grand saving of two digits.

Wow, isn’t there a really low probability of finding your phone number in the first 200m digits of pi? (0.09995% in first 100m) I’m tempted to start throwing a dictionary of phone numbers through this pi lookup to find your number, call you, and verify, I think you could quickly narrow in on your phone number given the information above.
I think you'll have trouble. Assuming the first digit of the phone number is somewhere between 114.5 million and 115.5 million, you have 1 million potential 11 digit numbers to check.

There are 10^11 sequences with 11 digits. The number of people in the USA is 3×10^8, and we can assume there is roughly 1 number per person (some people don't have a phone number and some people have more than one, but it turns out that the exact approximation won't matter unless we're a few of orders of magnitude off). So about 0.3% of 11 digit sequences are valid phone numbers.

So there are approximately 0.3% × 1000000 = 3000 people with phone numbers around the 115 millionth digit of pi. You have no way of knowing which one of those people is sjcsjc.

It'd be a great 1 person-audience magic trick to call up all 3000 of those people and say, "Hi sjcsjc, I found your phone number in Pi".
Derren Brown did a great series of tricks based on probability.. your comment reminded me on the one that showed a video of him flipping 10 heads in a row on an unbiased coin. He revealed them that he had done it by flipping a coin for hours until he got that sequence.
Teller has made a similar statement, about practicing something more than anyone would find reasonable, so the trick is just in being able to do it.
Here's an implementation of precisely that, as a filesystem:

https://github.com/philipl/pifs

Discussed here previously (2014):

https://news.ycombinator.com/item?id=8018818

The indexes will, on average, have more digits than the sequence you compress. This is true for any infinite sequence of digits, not only Pi.
World's most painful compression algorithm: Finds a mathematical series of infinite digits, and the offset into it, that most efficiently compresses data passed in. Probably have to chunk the data up to make this efficient.

Of course one can argue that all current real life compression algorithms are aiming to simulate this, and that a brute force algorithm is one of those "after turning the sun into a CPU, still won't have enough compute power to finish the problem" types of solutions.