|
|
|
|
|
by simgt
246 days ago
|
|
I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to. Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance. |
|
We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.
After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772
Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.