| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EngineeringStuf 482 days ago

Am I correct in reading that the RAG pipeline runs in realtime in response to a user query?

If so, then I would suggest that you run it ahead of time and generate possible questions from the LLM based on the context of the current semantically split chunk.

That way you only need to compare the embeddings at query time and it will already be pre-sorted and ranked.

The trick, of course, is chunking it correctly and generating the right questions. But in both cases I would look to the LLM to do that.

Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

3 comments

TechDebtDevin 482 days ago

> Happy to recommend some tips on semantically splitting documents using the LLM with really low token usage if you're interested.

Go on please :)

link

triyambakam 482 days ago

So if the user submitted a question not already generated, would that be like a cache miss and it would instead fall back to a real time query?

link

EngineeringStuf 482 days ago

Yes, but you could optimise the generated questions over time to reduce cache-misses.

link

ekianjo 482 days ago

> time and generate possible questions from the LLM based on the context of the current semantically split chunk.

Possible but very compute intensive. Imagine if you have hundreds of thousands of chunks...

link

EngineeringStuf 482 days ago

The number of chunks would be the same regardless of either approach.

The generation of questions can be done out-of-band by a cheaper model.

Their current implementation approach seems to require some computation per request. It would be a balance to see which strategy provides the most value.

The speed of responses overall would be faster.

link