Hacker News new | ask | show | jobs
by nitotm 23 hours ago
Thanks. I will try to answer.

ELD works like a traditional language detector, storing n-grams and tuned scores. (So it does not use a modern neural network).

It cleans the input text and extracts words, gets n-grams/tokens, each n-gram hash is searched on a fast hashtable, which points to several score slots for x amount of languages. And we build the scores for each of the found languages.

Sounds simple, and it is, because the work is done when training the database, setting the score values.

The database looks something like {"ngram_1":{Lang_id_1:score, Lang_id_7: score, ...}}, {"ngram_2":{Lang_id_5:score}}

I hope this answers your question. I could go into more detail.

Also, if anybody finds this interesting you could "Vouch" this post, so it goes public as it is hidden since I am a new user.

1 comments

Okay. Each n-gram in the database has a score for each language. What does each score mean?
Well, not for every language, but for every language that uses that n-gram. The scoring weights the importance of an n-gram for a particular language.
How are the weights calculated?