| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tgv 1325 days ago
	Looks very much web-based, and not cleaned properly. I conclude that because digits are pretty rare in a normal corpus, much rarer than x and y. The English list also has some punctuation included, and half of the Greek alphabet. The counting didn't exclude proper names and formulas, I suppose. So if you want to identify the domain of a Wikipedia page based on 1-grams, this is helpful; otherwise, less so.