Hacker News new | ask | show | jobs
by wooorm 4271 days ago
Agreed :)
1 comments

for the case of "one sentence detection" you can use Tatoeba project database dump http://tatoeba.org/eng/downloads

you have a CSV of iso code => sentence , which should be 99% accurate (as it gets user proofed), so on in which you can compare your tool with.

I think for longer text one could use Wikipedia dump or alike ?

Thanks for the pointer. I might decide to whip something up one of these days. I really have no need for language detection, but I just find it interesting and I'm curious to see wich libraries will win out.