So this dumps the documents returned from the vector store into a prompt to the LLM. How does it work when there are many documents returned? What's the upper limit there?
Yep. We use LangChain's basic text splitter to chunk the documents and the QA chain to stuff it into the prompt. But AFAIK it doesn't check for context length so that's a piece that's still missing.
Upper limit depends on the model, Llama 2 is 4k including the prompt.
Upper limit depends on the model, Llama 2 is 4k including the prompt.