| I agree too. My impression is that almost all RAG tutorials _only_ talk about vector DBs, when these are not strictly required for Retrieval Augmented Generation. I'm guessing vector DBs are useful when you have massive amounts of documents on diverse topics. Some gotchas I experienced (but I might be using the wrong embedding/vector DB: spaCy/FAISS): - Short user questions might result a low signal query vector, e. g. user : "Who is Keanu Reeves?" -> false positives on Wikipedia articles which only contain "Who is" - Typos and formatting affects the vectorization, a small difference might lead to a miss, e.g. "Who is Keanu Reeves?" -> match, "Who is keanu Reeves?" -> no match, no match with any other capitalization. If there's only a single document, a simple keyword search might lead to better results. In my experience, false positives (retrieving an irrelevant text and generating completely wrong answer) are a bigger problem than negatives (not retrieving text, possibly can't answer question). Has somebody experience with Apache Lucene / Solr or Elasticsearch? |
I've been working on a RAG with Solr, and quickly hit some of the issues you describe when dealing with real-world messy data and user input, e.g. using all-MiniLM-L6-v2 and cosine similarity, "Can you summarize Immanuel Kant's biography?" matched a chunk containing just the word "Biography" rather than one which started "Immanuel Kant, born in 1724...", and "How high is Ben Nevis?" matched a chunk of text about someone called Benjamin rather than a chunk about mountains containing the words "Ben Nevis" and its height[0]. Switching embedding model has helped, but still not convinced that vector search alone is the silver bullet some claim it is. Still lots more to try though, e.g. hybrid search[1], query expansion[2], knowledge graphs etc.
[0] https://www.michael-lewis.com/posts/vector-search-and-retrie...
[1] https://sease.io/2023/12/hybrid-search-with-apache-solr.html
[2] https://news.ycombinator.com/item?id=38706913