| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CaptainOfCoit 246 days ago
	> I don’t know how much Ollama contributes to llama.cpp If nothing else, Ollama is free publicity for llama.cpp, at least when they acknowledge they're mostly using the work of llama.cpp, which has happened at least once! I found llama.cpp by first finding Ollama and then figured I'd rather avoid the lock-in of Ollama's registry, so ended up using llama.cpp for everything.

1 comments

speedgoose 246 days ago

By the way, you can use hugging face with ollama, and local modelfiles too.

link

CaptainOfCoit 246 days ago

You're saying that like you cannot do that with llama.cpp? I feel like most Ollama users seem to have no idea what features/benefits directly come from llama.cpp rather than Ollama itself...

link

simgt 246 days ago

I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.

Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.

link

mtone 245 days ago

Recently llama.cpp made a few common parameters default (-ngl 999, -fa on) so it got simpler: --model and --context-size and --jinja generally does it to start.

We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.

After tasting this, not using it is a no-go. Meanwhile on Ollama there's an open issue asking for this: https://github.com/ollama/ollama/issues/11772

Finally, llama-swap separately provides the auto-loading/unloading feature for multiple models.

link

monkmartinez 245 days ago

Nailed it. To make matters worse, Ollama obfuscate the models so their users don't really know what they are running until they dig into the model file. Only then can they see that what they thought was Deepseek-r1 is actually an 8B qwen distillation of Deepseek-r1, for example.

Luckily, we have Jan.ai and LM Studio which are happy to run GGUF models at full-tilt on various hardware configs. Added bonus; both include very nice API server as well.

link