Hacker News new | ask | show | jobs
English Letter Frequency Counts: Mayzner Revisited (2013) (norvig.com)
50 points by sindoc 4546 days ago
6 comments

Can anybody help me understand how can this data be useful to anybody?

I was playing with n-gram for a while and even produced similar results. But I don't see how can those data be useful to anybody.

I upvoted you to offset the downvote(s) you've received. There should be nothing wrong with a newbie (to HN at least) asking a question like this. Would anyone who downvoted OP care to explain the downvote? The only logical reason I can think of that someone here would've downvoted is something along the lines of - "He should've known that crypto and ML are the obvious answers."
Here's a fun example. Distribution of word length can give you insight into how deep a fast lexical lookup tree (trie) needs to be to capture most words. After that depth, you can fall back on a more memory efficient, but slower, structure like a hashtable.

In this case: at a depth of 6, a trie can handle ~75% of all words. At 5 it can handle ~67%. Since tries can grow exponentially in memory (fully populated), reducing a level and still getting about a 70% solution might be good enough. It's about an 8% reduction in the size of the representable lexicon. However if you go to length 4, you can only cover ~56% of words. Meaning there's a 45% chance that a given word won't be stored in the trie.

Supposing we set a desired metric that the trie needs to handle 70% of all words, then depth 5 is pretty reasonable and space efficient with only a 1/3 chance that a word that in our lexicon won't be in the trie.

Do you really mean this? Two huge ones from a technical standpoint: Cryptography. Natural Language Processing; never mind the linguistic, word games, forensic, historical and literary uses. Also: fascinating!
OP site is unreachable at the moment. I can offer that letter frequency counts (ETNARIOUS?) are vital to basic (pen and paper) cryptanalysis and play a significant part in modern statistical and differential cryptanalysis.

Also, it used to guess what language a text might be in for anthropology and archaeology. The frequency charts for a given language are reliable enough (in a sufficiently large sample, etc) to guide that.

My first thought is cryptography, my second is language translation, my third is language recognition. Perhaps understanding word frequencies can also be used for linguistics, seeing how vocabularies change over time. There is a growing field of computational linguistics built around this.
When you are programmatically trying to unscramble something that has been encrypted, you end up trying a bunch of different keys. When you try a key and you want to know if the unscrambling worked, you can check the letter frequency distribution in your result against the letter frequency distribution of a big English corpus. If the distributions are close, you can be more confident that your key is correct.

It actually works pretty well. You end up having to do something like this pretty early on in the Matasano crypto challenge problems.

Utility is overrated. Personally I find this stuff fascinating.
I used Norvig's frequency counts as input for the board generation algorithms (in Scala) for my Android word game "5 Star Words" [1]. With this as the start plus a few other tricks, I'm typically able to reach an average of ~300 common English words (or easily 400+ when including less common and swear words) on a 4x4 letter board.

[1] https://play.google.com/store/apps/details?id=com.starwords

I think natural language designers might also look at the letter frequencies and question why 'E' shows up so much. Is the canonical sound it makes just common in English or is there some problem with its "design"? It turns out E is way overloaded in English:

- it's silent in the case of modifying preceding vowels separated by a medial consonant e.g. hat vs. hate, bat vs. bate

- and in older English (or English that wants to feel old) was a superfluous final letter e.g. olde, pubbe

- as a silent letter entirely e.g. eagle

- as itself e.g. egg, education

- as a silent or nearly silent suffix separator for -ed e.g. dropped, judged

- as a non-silent suffix for -ed e.g. educated

- silent as an immediate vowel modifier in vowel digraphs (in some spellings) e.g. archaeology, encyclopaedia, caesar used to be ligatured it was so incidental.

- silent as a modifier on itself e.g. teen, feel

- one of several representation for schwa, ə e.g. taken (takən), enemy (enəmy)

etc.

'e' is a mess. It's mostly silent, either ignored completely or modifying something else (an issue even Benjamin Franklin tried to solve through a proposed spelling reform). It's conflated with schwa (the most common vowel sound in English yet has no singular representation).

A language reformer would probably tackle this letter first and fix a great deal of the spelling problems in English.

"Natural language designer" is a contradiction; one of the core defining properties of natural languages (like English) is that they are not designed.

You switched to "reformer" in your closing sentence, perhaps that was what you originally meant, too?

Of course, such a reform is not exactly easy to implement.

I mean natural language as "language for humans to use to communicate with each other" as opposed to programming language as "language for humans to use to communicate with computers". It's the same meaning as is used in NLP.

e.g. https://en.wikipedia.org/wiki/Hangul https://en.wikipedia.org/wiki/Cyrillic and I guess even https://en.wikipedia.org/wiki/Klingon_language

This is different than the meaning of https://en.wikipedia.org/wiki/Natural_language and https://en.wikipedia.org/wiki/Constructed_language

I guess if you want to get pedantic a better term might be "Orthographic design".

Fun fact: If you're taking one vowel and five consonants the Wheel of Fortune letters RSTLNE—not in that order—are the letters that are most likely to occur
I love this. One minor representation issue: for the "Letter Counts by Position Within Word" that charting approach is less than helpful. Improvements within the structure he uses might be coding each letter with its own colour, and having each letter in its own column, reordered by length. However, charting experts may easily come up with a more useful re-charting approach better than I can off the top of my head.