Hacker News new | ask | show | jobs
by kanjus 2694 days ago
Great write-up!

> It is actually a very reasonable assumption for almost all kinds of data -- given that suitable compression is applied. Data, which is well-compressed, is essentially uniformly random.

What kinds of data are an exception? Your explanation seems to cover everything

1 comments

First of all, we have protocol data such as headers and framing which we might never properly get rid of. People might also send uncompressed data. All practical concerns, but for analysis independent (and even uniform) is not wrong, just rough.

Second, you might (will) not be able to completely compress the data. A picture might be worth a thousand words, but they still take out a megabyte or so on disk. That makes for about 1000 bytes per word ;) So the entropy/information of a picture might be very small ("A dog jumping into water"), but we have no chance of truly understanding a general source (reality) and expressing its full machinery.

Think about the difference between JPEG and PNG (or GZIP and a JavaScript minifier). They are designed for completely different assumptions about the source and even receiver. JPEG assumes that the most important part of an image is the human-visual understand; PNG is lossless, but assumes high inter-pixel dependence. GZIP assumes general bytes (I think); JS minification assumes that a there is a more fundamental representation of the source without noise (formatting, comments, reasonable names, dead functions).

Cheers!