|
|
|
|
|
by hackinthebochs
749 days ago
|
|
Entropy is a measure of complexity or disorder of a signal. The interesting part is that the disorder is with respect to the proper basis or dictionary. Something can look complex in one encoding but be low entropy in the right encoding. You need to know the right basis, or figure it out from the context, to accurately determine the entropy of a signal. A much stronger way of building a tool like the OP is to have a few pre-computed dictionaries for a range of typical source texts (source code, natural language), then encode the string against each dictionary, comparing the compressibility of the string. A high entropy string like a secret will compress poorly against all available dictionaries. |
|
Anyway, that doesn't really answer my question. To summarize answers in this thread, I think PhilipRoman has captured the essence of it: strictly speaking, the idea of entropy of a known string is nonsense. So, as I suspected, information theory definition isn't meaningfully applicable to the problem. And as other commenters like you mentioned, what we are really trying to measure is basically Kolmogorov complexity, which, strictly speaking, is incomputable, but measuring the compression rate for some well-known popular compression algorithm (allegedly) seems to be good enough estimate, empirically.
But I think it's still an interesting linguistic question. Meaningful or not, but it's well defined: so does it appear to work? Are there known constants for different kinds of text for any of these (or other) metrics? I would suspect this should have been explored already, but neither me, nor anybody in this thread apparently has ever stumbled upon such article.