Hacker News new | ask | show | jobs
by dbish 1151 days ago
It becomes a ranking problem in a sense. Lots of data that can be the answer and lots of context that “could” be relevant to put into the context window but then you have to pick the right context and answer with the most correct information which becomes less clear as your dataset increases.
3 comments

This is it. One of the "apps" I built was a slackbot for my classmates (in business school) that allows users to upload docs via slack (course notes, cases, etc.) that get embedded and you can then QA over in slack. I also added lots of hard-to-find or disparate information from our school like course reviews, registration information, calendars, etc. so we could all access it from one place.

The problem is once there are 10, 20, 30 different-but-similar documents in the vectorstore (like business school case studies), then asking the bot "what are the key takeaways from the airbnb case" grabs a bunch of useless embedded documents to provide as context. Yes, I can tell users how to ask better questions but it's a bad user experience and nobody stick with it or tries to understand why their queries don't work.

I could use hypothetical document embeddings but the problem is a lot of the cases or course notes are proprietary or not publicly available, so I would guess that the hypothetical answers the LLM would come up with won't provide much better context.

This was built with langchain + pinecone.

edit: I think smarter people than I are working on a lot of better ways to do this, but I think one potential solution is to apply metadata to each document when embedding it (e.g., ask the LLM to apply any number of X preset metadata tags) and then, when retrieving context from the vectorstore, filtering the results by those tags.

> The problem is once there are 10, 20, 30 different-but-similar documents in the vectorstore

Sounds like a de-duping problem. Maybe use vector embeddings to find near identical documents and limit them in the context. i.e. maximize the vector distance between your context sources.

> when retrieving context from the vectorstore, filtering the results by those tags

How would you determine what tags to filter by? Would you also need to rely on the LLM to say "which tags from the collection match this question"?

Thanks for sharing the detail, that's really helpful and I realize I've been facing a similar issue!
What vector store did you use? Was it an issue with the vector store or just the algorithm ANN is just not good with large datasets?
I wonder if that's the point when fine-tuning becomes a more appropriate option?