Hacker News new | ask | show | jobs
by shimon 5000 days ago
Unlikely but not impossible. Note that the substring does NOT need to repeat in the uncompressed file; it just needs to be possible to say "copy bytes X through Y of /bin/ls here" in fewer than Y-X bytes. I'd love to know what the odds really are, but it seems like you have some good options for improving them (e.g. ask for a really huge input file).
1 comments

> it just needs to be possible to say "copy bytes X through Y of /bin/ls here" in fewer than Y-X bytes

It is very unlikely that the length of the sequence will be smaller than the space required to store X. Your plan is similar to saying "The input could contain long repeated segments containing only zeroes. Compress those using scheme Z. The longer the file, the more likely this is to occur.". Information theory allows us to prove that this cannot always work. In fact it is very unlikely to work for a specific randomly generated target file.

So by saying 'cannot _always_ work,' and 'very unlikely to work,' you are implicitly conceding the point- that it is possible. The question then becomes, given the size of the OS environment space (and the quality of the randomness in the originals), could one in 50 attempts actually succeed in shaving off a byte? I sincerely doubt that Mike, despite his superior knowledge of information theory, actually took the time to figure out just how good or bad the odds were. He assumed that it was fundamentally impossible, when it is in fact not impossible, just unlikely (to some unknown [to me] degree).
Let me just put it this way. It's more likely that you could win using gzip. It's similar to saying it's "possible" that the target file could have contained all zeroes.

Assuming the target file is competently constructed, the chances of winning wouldn't make it a rational bet for $0.01

I'm so sorry you had to suffer under these postings.