| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kherud 748 days ago
	I think this comment explains it https://github.com/ggerganov/llama.cpp/discussions/4130#disc... As far as I understand (and mcharytoniuk should better confirm this), llama.cpp allows to chunk the context window of an LLM into independent blocks, such that multiple requests can be processed in a single inference. I think due to the auto-regressive nature of LLMs, you also don't have to wait for all sequences to finish to output them. As soon as one sequence finishes, you can use its "slot" in the context window for other requests.

1 comments

mcharytoniuk 747 days ago

Yes, exactly. You can split the available context into "slots" (chunks) so it can handle multipe requests concurrently. The number of them is configurable.