Y
Hacker News
new
|
ask
|
show
|
jobs
by
wskwon
1099 days ago
Not really. vLLM optimizes the throughput of your LLM, but does not reduce the minimum required amount of resource to run your model.
1 comments
e12e
1099 days ago
But (in theory) - llama.cpp could implement similar approach to paging/memory and see a speedup for 4bit models on cpu?
link