| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wskwon 1099 days ago
	Not really. vLLM optimizes the throughput of your LLM, but does not reduce the minimum required amount of resource to run your model.

1 comments

But (in theory) - llama.cpp could implement similar approach to paging/memory and see a speedup for 4bit models on cpu?