Hacker News new | ask | show | jobs
by allan_s 4271 days ago
for that, the way I've done for TatoDetect (which is meant specifically for the task of detecting the language for "one sentence a time" ) is to have a database of N-gram huge enough for a language to be nearly sure to have "them all", so that you can consider that if your text to detect contains a N-gram that your language does not have in database, you can apply a 'decrease score' for the said language.