| HN Mirror

BTW, here is my implementation of this idea: https://github.com/andreasvc/disco-dop/blob/master/web/parse...

I haven't it tested on more than 3 languages so it might perform badly but I have the intuition that it is easier to get good coverage of the vocabulary of languages than to get the frequencies of something like the top character n-grams right. The latter is affected by authorship and genre of text &c.