Hacker News new | ask | show | jobs
by panarky 1023 days ago
The first unstated assumption is that similar vectors are relevant documents, and for many use cases that's just not true. Cosine similarity != relevance. So if your pipeline pulls 2 or 4 or 12 document chunks into the LLM's context, and half or more of them aren't relevant, does this make the LLM's response more or less relevant?

The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.

All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

5 comments

The vectors are literally constructed so that cosine similarity is semantic similarity.

> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either

Its not unstated, its called ANN for a reason

> The vectors are literally constructed so that cosine similarity is semantic similarity.

Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.

And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.

It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.

There are embeddings that are trained to reflect similarity, for example SentenceBERT, where the training process pushes pairs of similar sentences (as defined by whoever built the dataset) to have closer embeddings and dissimilar sentences to be further apart.
As the OP points out, Cosine similarity doesn't always equate to relevance. As I was expanding upon, things get really messy as the dimensions increase and your intuition about how vectors relate to one another goes out the window, and fast. Distributional mass is not uniform. Rate of originality increases. And of course, there is no guarantees that latent dimensions align with human meaningful semantic features. There's no pressure to align basis vectors with human perceived semantics. My argument isn't about that there isn't a similarity pressure it's that similarity in high dimensions means different things then similarities in low dimensions. For example, in high dimensions most of a unit cube's mass lies outside the unit sphere, while in 2 or 3 dimensions the unit cube is always contained inside with room to spare. High dimensions are weird and that's what my comment is about because many people are using their lower dimensional intuition for ML.
Do you know how embedding models are trained?
Yes. My comment is about the geometry of higher dimensions and their meanings. These are not the same as in {2,3}D
To be fair… semantic similarity isn’t the same as relevance either.

They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.

I disagree, the embeddings are what are used by the llms themselves to produce relevant output and the output is relevant ergo the embeddings do produce relevant output via similarity search
You probably aren’t using an LLM for your text embeddings for document retrieval (they don’t perform as well as specialist embedding models[0]), and even if they did, you have an embedding about a bare document, without any context of what you are trying to get out of it. If you were to add your context in and then get an embedding, you would get a different answer. As your query gets specific, irrelevant aspects of the embedding space can overwhelm the similarity function, leading to irrelevant answers that are still semantically similar.

[0] https://huggingface.co/spaces/mteb/leaderboard

The recent SILO-LM paper has a slightly different approach: rather than using input embeddings and prompting the LLM with documents, it searches the database according to the LLM's output embedding and uses KNN search to skew the output embedding vector before token generation. Done that way round, using LLM embeddings outperforms RAG, allegedly.

They did it with a custom language model. I really want to give this a try with llama2 embeddings but haven't had the bandwidth yet (and llama2's embedding vectors are inconveniently huge, but that's a different problem).

Interesting! I’ll have to look into that.
Consider the extreme case: when I ask a question about X, then a page with just the questions about X will get the highest similarity. But what I want in terms of relevance for the answer is a page with a little bit about X and lots of surrounding context that answers the question. By definition the extra context will likely lower the similarity.
Not if you're using ANN. In some cases that will be very similar to exhaustive search but in other cases you'll get results that you don't want. You also need embeddings that distribute things mostly evenly across the embedding space (not all will).
That's interesting.

Are there any good sources to learn more about that?

This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.

I've had decent results using a doc2query style approach:

    1. Ask an LLM to return a list of questions answered by the document
    2. Store the embeddings of the questions along with a document ID
    3. On user query, get the embedding of the user query
    4. KNN cosine similarity search the user embedding vs. the corpus of question embeddings
    5. Return the highest ranked documents
You can tweak this approach depending on your use case, so that in step 1 you generate embeddings that are more similar to the types of things you want returned in step 5. If you want the answer to "What is a cat" to be similar to "What is a dog," you'd prompt/finetune the LLM in step 1 to generate broad questions that would encompass both; if you want them to be very different, you'd do the opposite and avoid generalities.
You just reinvented a 2 year old technique with a more expensive pipeline and missed performance gains (from the cross-encoder step):

https://www.sbert.net/examples/domain_adaptation/README.html https://arxiv.org/abs/2112.07577

I'm aware of more efficient ways to do it! (Hence referencing e.g. doc2query.) But you have to train a model, whereas with an LLM you can get a working version in 5mins of work.
But with even less work you can just pick up a model that was pre-trained using GPL and get great results.

I'm able to pull messy results directly from internet sources and re-rank on the fly with a quantized e5 model small enough to fit in a serverless function.

You don't need a vector database to do all this stuff, people who are paid off people using vector databases are the ones who are hyping them up the most.

yes - but calculating the consine similarity for all the candidates is prohibitively expensive.

hence heuristic.

Switching to Word2Vec embeddings led to a substantial improvement in my cosine similarity evaluations for text similarity, but granted I was looking for actual similarity, not relevance. I tried many different methods and had lots of mediocre results initially.

code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence

Interesting, do you happen to have some quantitative results on this/additional insights/etc?

I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.

There's no simplified definition like that, vectors can even capture logical properties, it's all down to what the model was tuned for: https://www.sbert.net/examples/training/nli/README.html
this is very interesting. you had better results here than the openai ada02 and other embeddings like bge ?
As opposed to sentencebert or what?
DistilBERT and RoBERTa
Could you please explain a bit on your 2nd paragraph. I couldn’t quite understand either the problem statement nor the reasoning itself.
"Cosine similarity != relevance" In all ML search products, there's a tradeoff between precision and recall, and moreover there's almost never any "gold" data that ensures the "correctness" of surfaced results. I mean, Bing and Google have both invested millions of dollars in labeling web pages and even evaluating search results, but those labels can become useless as your set of documents change.

Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.

> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

This is usually a Series E problem, not a Series A problem.

Azure Cognitive Search takes care of all of this combining semantic search with other layers of traditional search methods