| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cataflutter 2 hours ago
	Some of the inference engines can process multiple requests in parallel more efficiently than doing them sequentially. Not sure of the exact mechanism but e.g. llama.cpp's llama-server can do this (you tell it the number of slots to have when starting, then fire HTTP requests at it and it batches them together when it can). Waiting for the hooman (or tool calls) won't help either, of course.