Hacker News new | ask | show | jobs
by antirez 470 days ago
Ok, this explains why QwQ is working great on their chat. Btw I saw this thing multiple times: that ollama inference, for one reason or the other, even without quantization, somewhat had issues with the actual model performance. In one instance the same model with the same quantization level, if run with MLX was great, and I got terrible results with ollama: the point here is not ollama itself, but there is no testing at all for this models.

I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors.

2 comments

Yeah the state of the art is pretty awful. There have been multiple incidents where a model has been dropped on ollama with the wrong chat template, resulting in it seeming to work but with greatly degraded performance. And I think it's always been a user that notices, not the ollama team or the model team.
I'm grateful for anyone's contributions to anything, but I kinda shake my head about ollama. the reason stuff like this happens is they're doing the absolute minimal job necessary, to get the latest model running, not working.

I make a llama.cpp wrapper myself, and it's somewhat frustrating putting effort in for everything from big obvious UX things, like error'ing when the context is too small for your input instead of just making you think the model is crap, to long-haul engineering commitments, like integrating new models with llama.cpp's new tool calling infra, and testing them to make sure it, well, actually works.

I keep telling myself that this sort of effort pays off a year or two down the road, once all that differentiation in effort day-to-day adds up. I hope :/

Can you link your wrapper? I've read and run up against a lot of footguns related to Ollama myself and I think surfacing community efforts to do better would be quite useful.
Cheers, thanks for your interest:

Telosnex, @ telosnex.com --- fwiw, general positioning is around paid AIs, but there's a labor-of-love llama.cpp backed on device LLM integration that makes them true peers, both in UI and functionality. albeit with a warning sign because normie testers all too often wander into trying it on their phone and killing their battery.

My curse is the standard engineer one - only place I really mention it is one-off in comments like here to provide some authority on a point I want to make...I'm always one release away from it being perfect enough to talk up regularly.

I really really need to snap myself awake and ban myself from the IDE for a month.

But this next release is a BFD, full agentic coding, with tons of tools baked in, and I'm so damn proud to see the extra month I've spent getting llama.cpp tools working agentically too. (https://x.com/jpohhhh/status/1897717300330926109, real thanks is due to @ochafik at Google, he spent a very long term making a lot of haphazard stuff in llama.cpp coalesce. also phi-4 mini. this is the first local LLM that is reasonably fast and an actual drop-in replacement for RAG and tools, after my llama.cpp patch)

Please, feel free to reach out if you try it and have any thoughts, positive or negative. james @ the app name.com

The test vectors idea is pretty interesting! That's a good one.

I haven't been able to try out QwQ locally yet. There seems to be something wrong with this model on Ollama / my MacBook Pro. The text generation speed is glacial (much, much slower than, say Qwen 72B at the same quant). I also don't see any MLX versions on LM Studio yet.