| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by llmtosser 316 days ago

Distractions like this probably the reason they still, over a year now, do not support sharded GGUF.

https://github.com/ollama/ollama/issues/5245

If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama.

1 comments

jychang 316 days ago

That’s just llama-swap and llama.cpp

link

llmtosser 316 days ago

Interesting - it does indeed seem like llama-server has the needed endpoints to do the model swapping and llama.cpp as of recently also has a new flag for the dynamic CPU offload now.

However the approach to model swapping is not 'ollama compatible' which means all the OSS tools supporting 'ollama' Ex Openwebui, Openhands, Bolt.diy, n8n, flowise, browser-use etc.. aren't able to take advantage of this particularly useful capability as best I can tell.

link