| HN Mirror

Recently llama.cpp made a few common parameters default (-ngl 999, -fa on) so it got simpler: --model and --context-size and --jinja generally does it to start.

We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.

After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772

Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.