| The company I work for has tons of documentation and regulations for several areas. In some areas the documents are well over a thousand and for the ease of use of these documents we build RAG based chat bots. This is why I have been playing with RAG systems on the scale of "build completely from scratch" to "connect the services in Azure". The retrieval part of a RAG is vital for good/reliable answers and if you build it naive, the results are not overwhelming. You can improve on the retrieved documents in many ways, like
- by better chunking, - better embedding, - embedding several rephrased versions of the query, - embedding a hypothetical answer to the prompt, - hybrid retrieval (vector similarity + keyword/tfidf/bm25 related search), - massively incorporating meta data, - introducing additional (or hierarchical) summaries of the documents, - returning not only the chunks but also adjacent text, - re-ranking the candidate documents, - fine tuning the LLM and much, much more. However, at the end of the day a RAG system usually still has a hard time answering questions that require an overview of your data. Example questions are: - "What are the key differences between the new and the old version of document X?" - "Which documents can I ask you questions about?" - "How do the regulations differ between case A and case B?" In these cases it is really helpful to incorporate LLMs to decide how to process the prompt. This can be something simple like query-routing, or rephrasing/enhancing the original prompt until something useful comes up. But it can also be agents that come up with sub-queries and a plan on how to combine the partial answers. You can also build a network of agents with different roles (like coordinator/planner, reviewer, retriever, ...) to come up with an answer. * edited the formatting |
My experience has been that they are far too unpredictable to be of use.
In my testing with agent networks, it was a challenge to force it to provide a response, even if it was imperfect. So if there's a "reviewer" in the pool, it seemed to cause the cycle to keep going with no clear way of forcing it to break out.
3.5 actually worked better than 4 because it ran out of context sooner.
I am certain that I could have tuned it to get it to work, but at the end of the day, it felt like it was easier and more deterministic to do a few steps of old-fashioned data processing and then handing the data to the LLM.