| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ggnore7452 636 days ago

if anything i would consider embeddings bit overrated, or it is safer to underrate them.

They're not the silver bullet many initially hoped for, they're not a complete replacement for simpler methods like BM25. They only have very limited "semantic understanding" (and as people throw increasingly large chunks into embedding models, the meanings can get even fuzzier)

Overly high expectations lets people believe that embeddings will retrieve exactly what they mean, and With larger top-k values and LLMs that are exceptionally good at rationalizing responses, it can be difficult to notice mismatches unless you examine the results closely.

3 comments

deepsquirrelnet 636 days ago

Absolutely. Embeddings have been around a while and most people don’t realize it wasn’t until the e5 series of models from Microsoft that they even benchmarked as well as BM25 in retrieval scores, while being significantly more costly to compute.

I think sparse retrieval with cross encoders doing reranking is still significantly better than embeddings. Embedding indexes are also difficult to scale since hnsw consumes too much memory above a few million vectors and ivfpq has issues with recall.

link

nostrebored 636 days ago

Off the shelf embedding models definitely underpromise and overdeliver. In ten years I'd be very surprised if companies weren't fine-tuning embedding models for search based on their data in any competitive domains.

link

kkielhofner 636 days ago

My startup (Atomic Canyon) developed embedding models for the nuclear energy space[0].

Let's just say that if you think off-the-shelf embedding models are going to work well with this kind of highly specialized content you're going to have a rough time.

[0] - https://huggingface.co/atomic-canyon/fermi-1024

link

kkielhofner 636 days ago

> they're not a complete replacement for simpler methods like BM25

There are embedding approaches that balance "semantic understanding" with BM25-ish.

They're still pretty obscure outside of the information retrieval space but sparse embeddings[0] are the "most" widely used.

[0] - https://zilliz.com/learn/sparse-and-dense-embeddings

link