| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by estreeper 960 days ago

For embeddings specifically, there are multiple open source models that outperform OpenAI’s best model (text-embedding-ada-002) that you can see on the MTEB Leaderboard [1]

> embedding-based approach will be cheaper and faster, but worse result than full text

I’m not sure results would be worse, I think it depends on the extent to which the models are able to ignore irrelevant context, which is a problem [2]. Using retrieval can come closer to providing only relevant context.

1. https://huggingface.co/spaces/mteb/leaderboard

2. https://arxiv.org/abs/2302.00093

1 comments

karmasimida 959 days ago

> on the MTEB Leaderboard

The point isn't about leaderboard. With increasing context length, the question is on whether we need embeddings or not. With longer context length, embeddings is no longer a necessity, and it lowers its value.

link

civilitty 959 days ago

For more trivial use cases, sure, but not for harder stuff like working with US law and precedent.

The US Code is on the order of tens of millions of tokens and I shudder to think how many billions of tokens make up all the judicial opinions that set or interpreted precedent.

link