|
|
|
|
|
by antirez
470 days ago
|
|
Ok, this explains why QwQ is working great on their chat. Btw I saw this thing multiple times: that ollama inference, for one reason or the other, even without quantization, somewhat had issues with the actual model performance. In one instance the same model with the same quantization level, if run with MLX was great, and I got terrible results with ollama: the point here is not ollama itself, but there is no testing at all for this models. I believe that models should be released with test vectors at t=0, providing what is the expected output for a given prompt for the full precision and at different quantization levels. And also for specific prompts, the full output logits for a few tokens, so that it's possible to also compute the error due to quantization or inference errors. |
|