Hacker News new | ask | show | jobs
by vectoral 1151 days ago
This is one of the areas of LLMs that I find most interesting. So far, I've found simple question-answering over vectorstores to be a lackluster experience. In particular, the more information you embed and stick into the vectorstore, the less useful the system becomes as you are less likely to get the information you're looking for (especially if the users don't understand their queries need to look like the docs the want to ask about.

I haven't had a chance to try out hypothetical embedded docs yet, but I expect they only provide a marginal improvement (especially if QAing over proprietary data or information).

I'd love to see any other interesting, more up-to-date resources anyone has found on this topic. I found this recent paper interesting: https://arxiv.org/abs/2304.11062

1 comments

> In particular, the more information you embed and stick into the vectorstore, the less useful the system becomes as you are less likely to get the information you're looking for

Can you explain that? I don't follow why it would become less useful

It becomes a ranking problem in a sense. Lots of data that can be the answer and lots of context that “could” be relevant to put into the context window but then you have to pick the right context and answer with the most correct information which becomes less clear as your dataset increases.
This is it. One of the "apps" I built was a slackbot for my classmates (in business school) that allows users to upload docs via slack (course notes, cases, etc.) that get embedded and you can then QA over in slack. I also added lots of hard-to-find or disparate information from our school like course reviews, registration information, calendars, etc. so we could all access it from one place.

The problem is once there are 10, 20, 30 different-but-similar documents in the vectorstore (like business school case studies), then asking the bot "what are the key takeaways from the airbnb case" grabs a bunch of useless embedded documents to provide as context. Yes, I can tell users how to ask better questions but it's a bad user experience and nobody stick with it or tries to understand why their queries don't work.

I could use hypothetical document embeddings but the problem is a lot of the cases or course notes are proprietary or not publicly available, so I would guess that the hypothetical answers the LLM would come up with won't provide much better context.

This was built with langchain + pinecone.

edit: I think smarter people than I are working on a lot of better ways to do this, but I think one potential solution is to apply metadata to each document when embedding it (e.g., ask the LLM to apply any number of X preset metadata tags) and then, when retrieving context from the vectorstore, filtering the results by those tags.

> The problem is once there are 10, 20, 30 different-but-similar documents in the vectorstore

Sounds like a de-duping problem. Maybe use vector embeddings to find near identical documents and limit them in the context. i.e. maximize the vector distance between your context sources.

> when retrieving context from the vectorstore, filtering the results by those tags

How would you determine what tags to filter by? Would you also need to rely on the LLM to say "which tags from the collection match this question"?

Thanks for sharing the detail, that's really helpful and I realize I've been facing a similar issue!
What vector store did you use? Was it an issue with the vector store or just the algorithm ANN is just not good with large datasets?
I wonder if that's the point when fine-tuning becomes a more appropriate option?