|
|
|
|
|
by krick
746 days ago
|
|
Is there any good posts about the use of entropy for tasks like that? I am wondering for quite some time of how do people actually use it and if it is any effective, but never actually got to investigating the problem myself. First of all, how to define "entropy" for text is a bit unclear in the first place. Here it's as simple as `-Sum(x log(x))` where x = countOccurences(char) / len(text). And that raises a lot of questions about how good this actually works. How long string needs to be for this to work? Is there a ≈constant entropy for natural languages? Is there a better approach? I mean, it seems there must be: "obviously" "vorpal" must have lower "entropy" than "hJ6&:a". You and I both "know" that because 1) the latter "seems" to use much larger character set than natural language; 2) even if it didn't, the ordering of characters matters, the former just "sounds" like a real word, despite being made up by Carroll. Yet this "entropy" everybody seems to use has no idea about any of it. Both will have exactly the same "entropy". So, ok, maybe this does work good enough for yet-another-github-password-searcher. But is there anything better? Is there more meaningful metric of randomness for text? Dozens of projects like this, everybody using "entropy" as if it's something obvious, but I've never seen a proper research on the subject. |
|