Hacker News new | ask | show | jobs
by yoyo1999 4554 days ago
Can anybody help me understand how can this data be useful to anybody?

I was playing with n-gram for a while and even produced similar results. But I don't see how can those data be useful to anybody.

7 comments

I upvoted you to offset the downvote(s) you've received. There should be nothing wrong with a newbie (to HN at least) asking a question like this. Would anyone who downvoted OP care to explain the downvote? The only logical reason I can think of that someone here would've downvoted is something along the lines of - "He should've known that crypto and ML are the obvious answers."
Here's a fun example. Distribution of word length can give you insight into how deep a fast lexical lookup tree (trie) needs to be to capture most words. After that depth, you can fall back on a more memory efficient, but slower, structure like a hashtable.

In this case: at a depth of 6, a trie can handle ~75% of all words. At 5 it can handle ~67%. Since tries can grow exponentially in memory (fully populated), reducing a level and still getting about a 70% solution might be good enough. It's about an 8% reduction in the size of the representable lexicon. However if you go to length 4, you can only cover ~56% of words. Meaning there's a 45% chance that a given word won't be stored in the trie.

Supposing we set a desired metric that the trie needs to handle 70% of all words, then depth 5 is pretty reasonable and space efficient with only a 1/3 chance that a word that in our lexicon won't be in the trie.

Do you really mean this? Two huge ones from a technical standpoint: Cryptography. Natural Language Processing; never mind the linguistic, word games, forensic, historical and literary uses. Also: fascinating!
OP site is unreachable at the moment. I can offer that letter frequency counts (ETNARIOUS?) are vital to basic (pen and paper) cryptanalysis and play a significant part in modern statistical and differential cryptanalysis.

Also, it used to guess what language a text might be in for anthropology and archaeology. The frequency charts for a given language are reliable enough (in a sufficiently large sample, etc) to guide that.

My first thought is cryptography, my second is language translation, my third is language recognition. Perhaps understanding word frequencies can also be used for linguistics, seeing how vocabularies change over time. There is a growing field of computational linguistics built around this.
When you are programmatically trying to unscramble something that has been encrypted, you end up trying a bunch of different keys. When you try a key and you want to know if the unscrambling worked, you can check the letter frequency distribution in your result against the letter frequency distribution of a big English corpus. If the distributions are close, you can be more confident that your key is correct.

It actually works pretty well. You end up having to do something like this pretty early on in the Matasano crypto challenge problems.

Utility is overrated. Personally I find this stuff fascinating.