Hacker News new | ask | show | jobs
by raverbashing 4120 days ago
Some simple explanation

Compression exploits redundancy in a data stream (basically). You basically get "all symbols" (and how you define this varies according to your compression method: you could do all letters in the case of text, or even text snippets that repeat, etc) and reassemble them in a way that the ones that repeat the most take less space (and you also need to start from a basic dictionary known by all uncompressors or ship it with your compressed file)

One simple analogy is writing with abbreviations, but if you write e.g. the reader has to know what "e.g." means or you have to put in the beginning "e.g. = example" (and this also takes space)

Now, a randomly generated file ideally has all symbols repeating with the same frequency, (we say all symbols have the same entropy - I'm not sure about this exact wording), hence you can't take a symbol that repeats more or less and make it take less space in your compressed file

1 comments

what if instead of using your own dictionary, you use an index into an existing dictionary? such as an index into a subsequence of pi. Couldn't you then find a sequence of bytes in the file in which the index into pi takes less bytes and then replace them all with the index? If you couldn't find any in pi use e or another such number? What am I missing
In this case your dictionary either doesn't have everything or to adequately point to it you take as much space as not using it.

While Pi has all pairs of 2 digits, your index would take more space than storing the pairs itself (because you might need to go beyond position 99)

For one situation you might "get lucky" and find a coincidence, but this won't scale generically

See also: https://github.com/philipl/pifs and https://github.com/philipl/pifs/issues/23

I admit this fooled me for a bit. Good news is, I won't be fooled again by something similar :)