|
|
|
|
|
by dust42
123 days ago
|
|
320 tok/s PP and 42 tok/s TG with 4bit quant and MLX. Llama.cpp was half for this model but afaik has improved a few days ago, I haven't yet tested though. I have tried many tools locally and was never really happy with any. I tried finally Qwen Code CLI assuming that it would run well with a Qwen model and it does. YMMV, I mostly do javascript and Python. Most important setting was to set the max context size, it then auto compacts before reaching it. I run with 65536 but may raise this a bit. Last not least OpenCode is VC funded, at some point they will have to make money while Gemini CLI / Qwen CLI are not the primary products of the companies but definitely dog-fooded. |
|
Btw I also get 42-60 tps on M4 Max with the MLX 4 bit quants hosted by LM Studio, which software do you use to run it ?