| HN Mirror

“Excellent query good sir! <said slowly enough to let the LLM catch up>…”

And more seriously, it seems like the LLM could be used to precreate lots of filler prefixes that correspond to the rag’d document that are being sent to the model.

While it wouldn’t work if you’re GPU’d bound, multiple prompts could be run in parallel with different pieces of context and then have the model chose the most appropriate response (which could be done in parallel too).