Hacker News new | ask | show | jobs
by wodenokoto 1588 days ago
> Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough.

I only dabbled in language detection at a workshop at a conference years ago, but I was very impressed how well such models work on short text with only bigrams.

Maybe once you expand to over 70 languages does bi- and tri-grams fall short, but I just wanted to say that this is a usecase here very simple models can get you really far.

If you see a blog post where a language detection problem is solved with deep learning chances are the author doesn’t know what they are doing (towards datascience, I’m looking at you!) or it’s a tutorial for working with an NN framework.

2 comments

My experience is the opposite: character ngram models work "OK" on academic tasks and clean corpora. Not so much when unleashed on real data.

By "real", I mean texts in a mix of multiple languages (super common on the web); short texts; texts in a different (unknown) language where ngrams don't know how to say "I don't know" and return rubbish instead; texts in close languages; etc.

Going "deep learning" is not the only alternative. Even simpler methods can work significantly better, while being fully interpretable:

https://link.springer.com/chapter/10.1007/978-3-642-00382-0_...

We're using libraries like this to try to guess the language of a book based on title alone (in case no other information is readily available), and trigram-based algorithms get it wrong often enough for it to be noticeable. I will look into replacing our current library with this one, it seems better suited for the task at hand.
Yeah, language detection on short texts is quite complex. In my practice, N-grams doesn’t work well for them.