Hacker News new | ask | show | jobs
Show HN: ELDC – Natural language identification, faster than FastText and CLD2 (github.com)
3 points by nitotm 3 days ago
I want to introduce ELDC, an efficient language detector, written in C, designed to maximize speed and accuracy within a relatively constrained memory footprint.

ELDC is the latest iteration of the ELD software I made years ago. This version is available as an executable, a library, and a Python package.

This is my first C software, or anything compiled for that matter, I previously built this in pure PHP, JavaScript, and Python.

Highlights: - Performance: In my benchmarks, it runs faster than CLD2 and much faster than FastText. I believe the results are reproducible for any workload. - Accuracy: Within its supported language set, the benchmarks show it to be more accurate than Lingua, CLD3, CLD2, FastText, and etc. Accuracy is very benchmark dependent, so I will make no claim other than ELDC is highly accurate. - It supports 60 languages. Its architecture is highly efficient with database size scaling, I can add more n-grams or languages with a relatively low impact. - Memory usage: The compiled software is about 26MB, and it also builds a 32MB hashtable on load.

Notes: - Database size: I do have other database sizes (featured in the PHP version), but I went for simplicity and used the optimal size. But more sizes could be added. - Single Detection: I optimized for multi-detection. For single, a B-tree would offer faster loading and lower memory usage than the current hashtable. I haven't anticipated to be the most common use case, but it could be optimized for.

I would like to get some feedback, I'm curious to see if my speed claims hold true against your own tests. :)

1 comments

It's very interesting. Thank you for making it open-source. Would you like to explain the inner workings of ELD? For example, what is the model of language that it produces, how does it compare a new word to the model and how is the word scored for different languages.
Thanks. I will try to answer.

ELD works like a traditional language detector, storing n-grams and tuned scores. (So it does not use a modern neural network).

It cleans the input text and extracts words, gets n-grams/tokens, each n-gram hash is searched on a fast hashtable, which points to several score slots for x amount of languages. And we build the scores for each of the found languages.

Sounds simple, and it is, because the work is done when training the database, setting the score values.

The database looks something like {"ngram_1":{Lang_id_1:score, Lang_id_7: score, ...}}, {"ngram_2":{Lang_id_5:score}}

I hope this answers your question. I could go into more detail.

Also, if anybody finds this interesting you could "Vouch" this post, so it goes public as it is hidden since I am a new user.

Okay. Each n-gram in the database has a score for each language. What does each score mean?
Well, not for every language, but for every language that uses that n-gram. The scoring weights the importance of an n-gram for a particular language.
How are the weights calculated?