Hacker News new | ask | show | jobs
by VHRanger 803 days ago
The transformer models handle multilingual directly.

For good old embedding models (eg. GLoVe), you have a few choices:

1. LASER as you mentioned. The performance tends to suck though.

2. Language prediction + one embedding model per supported language. Libraries like whichlang make this nice, and MUSE has embedding modes aligned per language for 100ish languages.

Fastembed is a good library for this.

Note that for most people, 32 dimension glove is all they need if you benchmark it.

As the length of the text you're embedding goes up, or as the specificity goes up (eg. You have only medical documents and want difference between them) you'll need richer embeddings (more dimensions, or a transformer model, or both)

People never benchmark their embeddings and I find it incredible how they end up with needlessly overenginneered systems.

1 comments

Any idea which model has the best performance across languages? I'm checking out model performance on the Huggingface leaderboard and the top models for English aren't even in the top 20 for Polish, French and Chinese
Depends on what your usecase is.

For the normal user that just wants something across languages, the minilm-paraphrase-multilingual in the OP library is great.

Of you want better than that (either bigger model, or specifically for a subset of languages, etc.) then you need to think about your task, priorities, target languages, etc.