| RAG is a hack for lots of reasons, but the reason I'm focused on at the moment is the pipeline. Say you are trying to do RAG in a chat-type application. You do the following: 1) Summarize the context of chat into some text that is suitable for a search (lossy). 2) Turn this into a vector embedded in a particular vector space. 3) Use this vector to query a vector database, which returns reference to documents or document fragments (which themselves have been indexed as a lossy vector). 4) Take the text of these fragments and put them in the context of the LLM as input. 5) Modify the prompt to explain what these fragments are. 6) Then the prompt is sent to the LLM, which turns it into it's own vector representation. An obvious improvement to this is that the VectorDB and the LLM should share an internal representation, and the VectorDB should understand this. The LLM should take this vector input as a second input alongside the text context and the LLM should combine them (in the same way you can put a text and image into a multi-modal model) |