|
|
|
|
|
by kherud
748 days ago
|
|
I think this comment explains it https://github.com/ggerganov/llama.cpp/discussions/4130#disc...
As far as I understand (and mcharytoniuk should better confirm this), llama.cpp allows to chunk the context window of an LLM into independent blocks, such that multiple requests can be processed in a single inference. I think due to the auto-regressive nature of LLMs, you also don't have to wait for all sequences to finish to output them. As soon as one sequence finishes, you can use its "slot" in the context window for other requests. |
|