Hacker News new | ask | show | jobs
by cuuupid 546 days ago
It has been for a while, we ended up building our own test set to evaluate embedding models on our domain.

What we realized after doing this is that MTEB has always been a poor indicator, as embedding model performance varies wildly in-domain compared to out-of-domain. You'll get decent performance (lets say 70%) with most models, but eeking out gains over that is domain-dependent more than it is model-dependent.

Personally I recommend NV-Embed because it's easy to deploy and get the other performance measurements (e.g. speed) to be high spec. You can then simply enrich your data itself by e.g. using an LLM to create standardized artifacts that point back to the original text, kind of like an "embedding symlink."

Our observation has widely been that after standardizing data, the best-n models mostly perform the same.

1 comments

Unfortunately it requires commercial licensing. I spoke with them a while ago about pricing and it was awfully expensive for being just one part of a larger product. We have been trying other common open source models and the results have been comparable when using them for retrieval on our domain specific data.