Hacker News new | ask | show | jobs
by BoorishBears 1022 days ago
This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.

1 comments

I've had decent results using a doc2query style approach:

    1. Ask an LLM to return a list of questions answered by the document
    2. Store the embeddings of the questions along with a document ID
    3. On user query, get the embedding of the user query
    4. KNN cosine similarity search the user embedding vs. the corpus of question embeddings
    5. Return the highest ranked documents
You can tweak this approach depending on your use case, so that in step 1 you generate embeddings that are more similar to the types of things you want returned in step 5. If you want the answer to "What is a cat" to be similar to "What is a dog," you'd prompt/finetune the LLM in step 1 to generate broad questions that would encompass both; if you want them to be very different, you'd do the opposite and avoid generalities.
You just reinvented a 2 year old technique with a more expensive pipeline and missed performance gains (from the cross-encoder step):

https://www.sbert.net/examples/domain_adaptation/README.html https://arxiv.org/abs/2112.07577

I'm aware of more efficient ways to do it! (Hence referencing e.g. doc2query.) But you have to train a model, whereas with an LLM you can get a working version in 5mins of work.
But with even less work you can just pick up a model that was pre-trained using GPL and get great results.

I'm able to pull messy results directly from internet sources and re-rank on the fly with a quantized e5 model small enough to fit in a serverless function.

You don't need a vector database to do all this stuff, people who are paid off people using vector databases are the ones who are hyping them up the most.

Oh, I wasn't suggesting using a vector DB. Personally I just iterate through the corpus and check cosine similarity with a for loop.

If by "quantized e5 model small enough to fit in a serverless function" you mean e5-small-v2, FYI it actually underperforms just calling OpenAI for embeddings (text-embedding-ada-002) on the HuggingFace MTEB benchmarks. And that definitely doesn't negate using a doc2query-style approach to preprocess the documents before running them through the pretrained embedding model if you're comparing e.g. questions to answers, rather than raw document-to-document similarity. (Of course a custom trained model will be more efficient! In fact, the original doc2query paper in 2019 used a custom trained model for step 1, as did many enhancements on it e.g. doc-t5-query. What's neat is that with the advent of really good pretrained LLMs, you can get results approximating that without training your own models in like ~5mins of work.)

I guess this really boils down to your usecase: if you can have a result for your user with fully predictable latency (my biggest beef with non-Azure OpenAI), no additional round trip, and increased configurability, does MTEB performance move the needle?

Considering the LLM is still doing the final pass, and the latency from the LLM is based on output length, I find the UX to be significantly improved just doing reranking in-process.

I think there's been a bit of whiplash, where people went from gatekeeping "hard ML", to "I can shove this all at a REST API", but there's a golden path laying in between for use-cases where UX matters.

I even fall back to old school NLP (like ML-less, glorified wordlist POS taggers) for LLM tasks and end up with significantly improved performance for almost 0 additional effort