| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by adzm 51 days ago
	It is worth noting that as the length of data increases it becomes extremely unlikely that the index and length of the sequence within pi would actually be smaller than the data.

8 comments

Aloisius 51 days ago

That seems easy enough to solve. Simply record the index and length in pi of the index and length in pi.

link

awesome_dude 51 days ago

See also: Recursion

link

jastr 51 days ago

Back in college, I thought I could compress my phone number by telling people its index in pi, but my 7 digit phone number is at an 8 digit index.

I didn’t have the compute to find my 10 digit number with the area code.

link

xavortm 51 days ago

HEX should've solved for char length?

link

mondrian 51 days ago

The index of your 20 line file is <20TB number>

link

russfink 51 days ago

Unless, in turn, you locate the index itself in pi at a much smaller index. And so on...

Find k candidate indices for your data, then locate each of them. If the smallest one is a significantly smaller index space, repeat.

link

akoboldfrying 51 days ago

Can't tell if you're in on the joke or not, but for anyone who is genuinely wondering whether this might work: Consider that there are at most 256 different indexes that could be represented by a 1-byte index value, but if you're trying to store 9 bits of data, there are already 512 different possible things it could be that each need to be represented by a different index value, otherwise you won't be able to tell them apart. Those pigeons aren't gonna fit.

link

jonhohle 51 days ago

That’s what variable length encoding is for!

link

Galanwe 51 days ago

It's recursive as well, you now need to store how many levels of indirection of indices you had to resolve, which will in turn take 20TB to store, unless you store that in pi as well, which in turn...

link

12_throw_away 51 days ago

yes I believe that's the joke

link

jwpapi 51 days ago

He’s aware, he just added some curious information.

link

hatthew 51 days ago

TFA addresses this

> Now, we all know that it can take a while to find a long sequence of digits in π, so for practical reasons, we should break the files up into smaller chunks that can be more readily found.

> In this implementation, to maximise performance, we consider each individual byte of the file separately, and look it up in π.

link

ithkuil 51 days ago

Why stop at bytes? Let's split it in individual bits and then look up the bits in pi!

But Pi's binary expansion is not very practical for this purpose, since it's 11.0010...

OTOH. e is 10.1011...

Let's stick to fractional digits (the ones right of the binary point) at index 0 we have 1 and at index 1 we have 0.

So, to encode a stream of bytes so that each bit is encoded as the index of that bit in the e, all you need to do is to xor it with 0xFF

link

nvader 51 days ago

Hang on hang on let me write a CUDA kernel for this. This is going to be really huge.

genius

Point taken about the index potentially being really long. Why would the length be longer than the data? Don’t you need to find the right sequence?

link

gowld 51 days ago

For a given length of data, considering all possible data of that length, it's impossible for the median length to be shorter than the data length. There aren't enough strings of that length that early in the data.

link

jerf 51 days ago

I wonder if it might make more sense to come at it from the opposite angle. Take pi as a sequence you want to compress with. But pi, being random, has redundancies in it that make it less than optimal. So instead, for a given size of block you want to look up, design the optimal number to use for compression. For instance, if you want to compress "594" in the digits of pi, the sequence 253 appears before it twice, which means any attempt to "compress" any three-digit sequence that only first appears after the second 253 is costing you more to get past the second 253, and "pi, but with all the 253s removed after the first one" is clearly a more efficient encoder for 3-digit numbers than pi itself.

So, instead of using pi, design an optimal number to encode with.

What you'll find is that the optimal sequence ends up being equally efficient as listing the blocks in order and indexing by block number itself. There are a number of other solutions; you could use superpermutations to get "all possible subsequences" with fewer digits in your target number, but you'll end up needing to provide the encoder and decoder a table of where the digit sequences appear since they are no longer regular and indexing into that table will cost exactly the same as just writing your number as the concatenation of all the blocks and its efficient method for indexing into them by indexing on the block rather than the digit number.

This actually has some natural overlap with the "normal numbers" in that one of the earlier normal numbers was: https://en.wikipedia.org/wiki/Champernowne_constant I'm not sure whether this is necessarily optimal for an arbitrary block size. (My quick intuitive check suggests it may be, but "my quick intuitive check" in the time of an HN post is not something I'd count on.) In this scheme, you can include the fact that the person using this constant to encode knows the nature of the constant, so they know that if you give index 0-9, it's single digit, and if you index into the two-length blocks, it must have a length of two. Since the encoder and decoder know that, they can also skip the middle of the block and just index into "the n'th number"... which degenerates into "the index of number N is N", which means this is not a compression scheme.

To put all that in a nutshell, if you want to deeply understand why this compression scheme doesn't work, I think you can attain a deep understanding of why by optimizing it.

link

account42 51 days ago

That just means you'll be creating even more valuable metadata to store your files. Win-win.

link

bandrami 51 days ago

At least as of 15 years ago when I was in grad school that remained an open conjecture.

link