Hacker News new | ask | show | jobs
by edshiro 807 days ago
I don't have much experience with embeddings...

Could someone more knowledgeable suggest when it would make sense to use the SentenceTransformers library vs for instance relying on the OpenAI API to get embeddings for a sentence?

4 comments

It's fairly easy to use, not that compute intensive (e.g. can run on even a small-ish CPU VM), the embeddings tend to perform well and you can avoid sending your data to a third party. Also, there are models fine tuned for particular domains on HF-hub, that can potentially give better embeddings for content in that domain.
Just to add to this, a great resource is the Massive Text Embedding Benchmark (MTEB) leaderboard which you can use to find good models to evaluate, and there are many open models that outperform i.e. OpenAI's text-embedding-ada-002, currently ranked #46 for retrieval, which you can use with SentenceTransformers.

https://huggingface.co/spaces/mteb/leaderboard

I see - thanks for the clarifications

I presume if your customers are enterprise companies then you may opt to use this library vs sending their data to OpenAI etc.

And you can get more customisation/fine-tuning from this library too.

Embeddings is one of those things that using OpenAI (or any other provider) isn't really necessary. There are many small open source embedding models that perform very well. Plus, you can finetune them on your task. You can also run locally and not worry about all the constraints (latency, rate limits etc) of using an external provider endpoint. If performance is important for you, then you'll need a GPU.

The main reason to use one of those providers is if you want something that performs well out of the box without doing any work and you don't mind paying for it. Those companies like OpenAI, Cohere and others, already did they work to make those models work well on various domains. They may also use larger models that are not as easy to deal with yourself. (although as I mentioned previously, a small embeddings model fine-tuned on your task is likely to perform as well as a much bigger general model)

You should basically never use the openAI embeddings.

There isn't a single usecase where they're better than the free models, and they're slower, needlessly large, and outrageously expensive for what they are.

Up until a month ago, the OpenAI embeddings where very poor. But they recently released a new model which is much better then they're previous one.

Now it depends un specific usecase (domain, language, length of texts)