Hacker News new | ask | show | jobs
by tgv 1325 days ago
Looks very much web-based, and not cleaned properly. I conclude that because digits are pretty rare in a normal corpus, much rarer than x and y. The English list also has some punctuation included, and half of the Greek alphabet. The counting didn't exclude proper names and formulas, I suppose. So if you want to identify the domain of a Wikipedia page based on 1-grams, this is helpful; otherwise, less so.