|
|
|
|
|
by llmtosser
316 days ago
|
|
Distractions like this probably the reason they still, over a year now, do not support sharded GGUF. https://github.com/ollama/ollama/issues/5245 If any of the major inference engines - vLLM, Sglang, llama.cpp - incorporated api driven model switching, automatic model unload after idle and automatic CPU layer offloading to avoid OOM it would avoid the need for ollama. |
|