> I don’t know how much Ollama contributes to llama.cpp
If nothing else, Ollama is free publicity for llama.cpp, at least when they acknowledge they're mostly using the work of llama.cpp, which has happened at least once! I found llama.cpp by first finding Ollama and then figured I'd rather avoid the lock-in of Ollama's registry, so ended up using llama.cpp for everything.
You're saying that like you cannot do that with llama.cpp? I feel like most Ollama users seem to have no idea what features/benefits directly come from llama.cpp rather than Ollama itself...
I read the opposite, that you don't have to be locked-in Ollama's registry if you don't want to.
Could you share a bit more of what you do with llama.cpp? I'd rather use llama-serve but it seems to require a good amount of fiddling with the parameters to have good performance.
Recently llama.cpp made a few common parameters default (-ngl 999, -fa on) so it got simpler: --model and --context-size and --jinja generally does it to start.
We end up fiddling with other parameters because it provides better performance for a particular setup so it's well worth it. One example is the recent --n-cpu-moe switch to offload experts to CPU while filling all available VRAM that can give a 50% boost on models like gpt-oss-120b.
Nailed it. To make matters worse, Ollama obfuscate the models so their users don't really know what they are running until they dig into the model file. Only then can they see that what they thought was Deepseek-r1 is actually an 8B qwen distillation of Deepseek-r1, for example.
Luckily, we have Jan.ai and LM Studio which are happy to run GGUF models at full-tilt on various hardware configs. Added bonus; both include very nice API server as well.
i mean they have attributed but also it's open source software, i guess the more meaningful question is why didn't ggerganov build Ollama if it was that easy? or what is his company working on now?
I can not answer for GG, but the early days of llama.cpp were crazy and everything was so very hacky. Remember, Textgen-webui was 'the way' to run models at first because it supported so many different quant types and file extensions. At the time, most people were using multiple different quantization methods and it was really hard to figure out which were performing better or worse objectively.
GGUF/GGML was like the 4th iteration of file type quantization from llama.cpp and I remember that I had to consciously begin watching the bandwidth usage from my ISP. Up to that point, I had never received an email warning me about reaching limits of my 2TB connection. All for the same models just in different forms. TheBloke was pumping out models like he had unlimited time/effort.
I say all that to say, llama.cpp was still trying, dare I say 'inventing', all the things throughout these transitions. Ollama comes in to make the running part easier and less CLI flag dependent building off of llama.cpp. Awesome.
GG and company are down in the trenches of the models architecture with CUDA, Vulkan, CPU, ROCm, etc. They are working on perplexity, token processing/generation and just look at the 'bin' folder when you compile the project. There are so many different aspects to make the whole thing work as well at it does. It's amazing that we have llama-server at all with the amount of work that has gone into making llama.cpp.
All that to say, Ollama shit the bed on attribution. They were called out on r/localllama very early on for not really giving credit to llama.cpp. They have a soiled reputation with the people that participate in that sub-reddit at least. They were called out for not contributing back if I remember correctly as well, which further stained their reputation among the folks who hang in that sub-reddit.
So it's not a matter of "ease" to build what Ollama built... At least from the perspective of someone who has been paying close attention from r/localllama; the problem was/is simply the perception (right or wrong) of the meme; Person 2 to person 1: "You built this?" -> Person 2: takes item/thing -> person 2: Holds up item/thing -> "I built this". A simple act that really pissed off the community in general.
I don’t know how much Ollama contributes to llama.cpp