|
|
|
|
|
by dbish
1151 days ago
|
|
It becomes a ranking problem in a sense. Lots of data that can be the answer and lots of context that “could” be relevant to put into the context window but then you have to pick the right context and answer with the most correct information which becomes less clear as your dataset increases. |
|
The problem is once there are 10, 20, 30 different-but-similar documents in the vectorstore (like business school case studies), then asking the bot "what are the key takeaways from the airbnb case" grabs a bunch of useless embedded documents to provide as context. Yes, I can tell users how to ask better questions but it's a bad user experience and nobody stick with it or tries to understand why their queries don't work.
I could use hypothetical document embeddings but the problem is a lot of the cases or course notes are proprietary or not publicly available, so I would guess that the hypothetical answers the LLM would come up with won't provide much better context.
This was built with langchain + pinecone.
edit: I think smarter people than I are working on a lot of better ways to do this, but I think one potential solution is to apply metadata to each document when embedding it (e.g., ask the LLM to apply any number of X preset metadata tags) and then, when retrieving context from the vectorstore, filtering the results by those tags.