Hacker News new | ask | show | jobs
by pw378 731 days ago
That lag between query and response ruins it for me.
2 comments

“Excellent query good sir! <said slowly enough to let the LLM catch up>…”

And more seriously, it seems like the LLM could be used to precreate lots of filler prefixes that correspond to the rag’d document that are being sent to the model.

While it wouldn’t work if you’re GPU’d bound, multiple prompts could be run in parallel with different pieces of context and then have the model chose the most appropriate response (which could be done in parallel too).

For me, it was the cuts between each call haha