| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dust42 123 days ago

320 tok/s PP and 42 tok/s TG with 4bit quant and MLX. Llama.cpp was half for this model but afaik has improved a few days ago, I haven't yet tested though.

I have tried many tools locally and was never really happy with any. I tried finally Qwen Code CLI assuming that it would run well with a Qwen model and it does. YMMV, I mostly do javascript and Python. Most important setting was to set the max context size, it then auto compacts before reaching it. I run with 65536 but may raise this a bit.

Last not least OpenCode is VC funded, at some point they will have to make money while Gemini CLI / Qwen CLI are not the primary products of the companies but definitely dog-fooded.

1 comments

ttoinou 121 days ago

Works for me, but sometimes there's an issue with the tool template from Qwen, past chats are changed, thus KV cache gets invalidated and it needs to reprocess input tokens from scratch. Doesn't happen all the time though

Btw I also get 42-60 tps on M4 Max with the MLX 4 bit quants hosted by LM Studio, which software do you use to run it ?

link

dust42 120 days ago

I use MLX server directly from the MLX community project (by Apple). 42 tps is with 0-5000 token context. Starts to drop from there, I have never seen 60.

Yesterday I tested the latest llama.cpp and the result is that PP has made a huge jump to 420 tps which is 30% faster than MLX on my M1. TG is now 25 tps which is below MLX but does not degrade much, at 50k context it is still 22-23 tps.

Together with Qwen code CLI llama.cpp does a lot less often re-process the full KV cache. So for now I am switching back to llama.cpp.

It is worth to spend some time with the settings. I am really annoyed by the silly jokes (was it Claude that started this?). You can disable them with customWittyPhrases. Also setting contextWindowSize will make the CLI auto compress, which works really well for me.

And depending on what you do, maybe set privacy.usageStatisticsEnabled to false.

Like Gemini, Qwen CLI supports OpenTelemetry. When I have time I'll have a look why the KV cache gets invalidated.

link

ttoinou 120 days ago

Great thanks ! I am so annoyed by a specific phrase which is "launching wit.exe", not funny when it could actually be talking for real about software running on your machine

link