Hacker News new | ask | show | jobs
by michaelmior 4274 days ago
It would be interesting to see comparisons with language detection libraries written in other languages as well. Not just in terms of runtime, but also accuracy. Actually, it seems like this would be useful as a separate project to help the decision-making process when choosing a library.
1 comments

Agreed :)
for the case of "one sentence detection" you can use Tatoeba project database dump http://tatoeba.org/eng/downloads

you have a CSV of iso code => sentence , which should be 99% accurate (as it gets user proofed), so on in which you can compare your tool with.

I think for longer text one could use Wikipedia dump or alike ?

Thanks for the pointer. I might decide to whip something up one of these days. I really have no need for language detection, but I just find it interesting and I'm curious to see wich libraries will win out.