Hacker News new | ask | show | jobs
by EngineeringStuf 482 days ago
Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?

If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.

That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.

The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.

Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

3 comments

> Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

Go on please :)

So if the user submitted a question not already generated, would that be like a cache miss and it would instead fall back to a real time query?
Yes, but you could optimise the generated questions over time to reduce cache-misses.
> time and generate possible questions from the LLM based on the context of the current semantically split chunk.

Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...

The number of chunks would be the same regardless of either approach.

The generation of questions can be done out-of-band by a cheaper model.

Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.

The speed of responses overall would be faster.