| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nitotm 23 hours ago

Thanks. I will try to answer.

ELD works like a traditional language detector, storing n-grams and tuned scores. (So it does not use a modern neural network).

It cleans the input text and extracts words, gets n-grams/tokens, each n-gram hash is searched on a fast hashtable, which points to several score slots for x amount of languages. And we build the scores for each of the found languages.

Sounds simple, and it is, because the work is done when training the database, setting the score values.

The database looks something like {"ngram_1":{Lang_id_1:score, Lang_id_7: score, ...}}, {"ngram_2":{Lang_id_5:score}}

I hope this answers your question. I could go into more detail.

Also, if anybody finds this interesting you could "Vouch" this post, so it goes public as it is hidden since I am a new user.

1 comments

dignal 21 hours ago

Okay. Each n-gram in the database has a score for each language. What does each score mean?

link

nitotm 18 hours ago

Well, not for every language, but for every language that uses that n-gram. The scoring weights the importance of an n-gram for a particular language.

link

dignal 1 hour ago

How are the weights calculated?

link