| I think it's doable by undergrad ML or NLP classes. In fact, we had a course for high school students where they learnt how a language guesser works and where they had to change a language guesser. A simplistic method that already works very well is: * Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams. * To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum. Of course, you can do fancier things, such as training a SVM or logistic regression classifier with n-grams and words as features, etc. An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes. |
Do you know any interesting work related to the language distinction idea on the same text?