|
|
|
|
|
by wodenokoto
1588 days ago
|
|
> Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough. I only dabbled in language detection at a workshop at a conference years ago, but I was very impressed how well such models work on short text with only bigrams. Maybe once you expand to over 70 languages does bi- and tri-grams fall short, but I just wanted to say that this is a usecase here very simple models can get you really far. If you see a blog post where a language detection problem is solved with deep learning chances are the author doesn’t know what they are doing (towards datascience, I’m looking at you!) or it’s a tutorial for working with an NN framework. |
|
By "real", I mean texts in a mix of multiple languages (super common on the web); short texts; texts in a different (unknown) language where ngrams don't know how to say "I don't know" and return rubbish instead; texts in close languages; etc.
Going "deep learning" is not the only alternative. Even simpler methods can work significantly better, while being fully interpretable:
https://link.springer.com/chapter/10.1007/978-3-642-00382-0_...