Hacker News new | ask | show | jobs
by pietz 967 days ago
I'm always happy to see OSS contributions but I don't quite understand why this model is so remarkable. As the leaderboard suggests it's ranking lower than OpenAI embeddings, while 14 other contributions are even better than that. Many of which feature a comparable or lower dimensionality than 768.

The 8k context window is new, but isn't the 512 token limitation a soft limit anyway? I'm pretty sure I can stuff bigger documents into BGE for example.

Furthermore, I think that most (all?) benchmarks in the MTEB leaderboard deal with very small documents. So there is nothing here that validates how well this model does on larger documents. If anything, I'd pick a higher ranking model because I put little trust in one that only ranks 17th on small documents. Should I expect it to magically get better when the documents get larger?

Plus, you can expect that this model was designed to perform well on the datasets in MTEB while the OpenAI model probably wasn't.

Many also stated that a 8k context embeddings will not be very useful in list situations.

When would anyone use this model?

4 comments

Potentially useful for paragraph embedding, where... well, paragraphs can grow a lot. Not sure how this model fares in comparison to other embedding engines (yet), but I can definitely tell you mpnet models fare much better for paragraph embeddings than the leader in HF's leaderboard (being thenlper/gte-large at time of writing).

I can guess the Davinci and similar embeddings work better for code than MPNET and it really matters what you are encoding, not only the context length. What features are actually being extracted by the emb.engine.

I have been trying to understand the hype as well. Happy to see all the work happening in this space still.

I was pretty curious about the context limit. I am not an expert in this area but I always thought the biggest problem was the length of your original text. So typically you might only encode a sentence or a selection of sentences. You could always stuff more in but they you are potentially losing the specificity, I would think that is a function of the dimensionality. This model is 768, are they saying I can stuff 8k tokens worth of text and can utilize it just as well as I have with other models on a per 1-3 sentence level?

Thinking about it some more as I read through more comments. I guess in the stated case of research papers it can make sense if your task is looking for the common themes and not specific details. If you are embedding a sentence or a paragraph you miss out on the connection between those sentences across the whole paper...or at least its harder to manage that. By encoding a large number of pages from the paper (or the entire paper) you can hopefully do a better job of capturing the theme of that paper.

This also opens up another question though, how would that compare to using a LLM to summarize that paper and then just embed on top of that summary.

I would guess that the embedded summary is better, but for many tasks where you use embeddings (like document search), summarizing every document with an LLM is too expensive and slow.
I fail to imagine a 8k-token-length piece of text that has just one single semantic coordinate and is appropriate for embedding and vector search.

In my experience, any text is better embedded using a sliding window of a few dozen words - this is the approximate size of semantic units in a written document in english; although this will wildly differ for different texts and topics.

What are you using those embeddings for?

I can see a sliding window working for semantic search and RAG, but not so much for clustering or finding related documents.

Ah yes, clustering is indeed something that would benefit from large context, I agree.

However even so I would think about the documents themselves and figure out if it is even needed. Lets say we are talking about clustering court proceedings. I'd rather extract the abstract from these document, embed and cluster those instead of the whole text.

> The 8k context window is new

Hasn’t Claude had this for many months (before they bumped to 100k)?

Edit: ah, you mean new for OSS maybe?

Claude is a large language model, which is a different thing from an embedding model.
Any large language model generates embedding representations at every layer of the model, and these can be trivially extracted. So, large language models are indeed embedding models.

This leaderboard doesn't compare these custom tailored embedding models vs the obvious thing of average pooling layered with any traditional LLM, which is easily implemented using sentence transformers.

Because 4K+ dimensional embeddings are functionally useless.
Aha, that’s what I missed, thanks!