|
|
|
|
|
by beering
4596 days ago
|
|
Alternatively, people can just download langid.py[1] and do language detection locally. This is not a particularly hard problem - I think it's doable by undergrad ML or NLP classes. The tricky parts are usually political - are users going to be angry if you confuse Indonesian with Malaysian, or so on? [1] https://github.com/saffsd/langid.py |
|
In fact, we had a course for high school students where they learnt how a language guesser works and where they had to change a language guesser. A simplistic method that already works very well is:
* Create an n-gram fingerprint for each language by making a list of character uni-, bi-, and trigrams ordered by their frequency in a text. Retain the (say) 300 most frequent n-grams.
* To categorize a text, create a fingerprint for that text. Then compute for each language the sum n-gram rank differences. If an n-gram does not occur, the difference is the fingerprint size. Finally, pick the language with the lowest sum.
Of course, you can do fancier things, such as training a SVM or logistic regression classifier with n-grams and words as features, etc.
An interesting variation is to be able to distinguish different languages in a text. E.g. a Dutch text with English quotes.