|
|
|
|
|
by jules
4274 days ago
|
|
Seems like this just compares the L1 distance of the trigram count vector to some preselected document in each language. That won't be very accurate. A much better way to go here is naive bayes. There are more sophisticated approaches but naive bayes will get you much further than this already. If you train this with wikipedia articles for the most popular languages you would most likely get >99% accuracy. |
|
I think I found it in this paper [1]. The implementation was like 13 lines of Python code. I wonder how it would compare.
[1] http://www.ccs.neu.edu/home/jaa/CSG399.05F/Topics/Papers/Ben...