| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sleepyeldrazi 53 days ago

I have been testing and using Qwen3.6 27B (running from my 3090) since it dropped and I genuinely think this is the first consumer hardware-grade model that can actually replace frontiers for a lot of workloads.

I ran 8 tests on a variety of open-weights models, and opus 4.7 (1mil ctx version) and the little dense model was right behind it: https://github.com/sleepyeldrazi/llm_programming_tests/tree/... Of note is that opus was the only model to push back against the spec on the hardest challenge, saying 'thats not possible', when there are links in the spec to examples of it being done.

There may be problems with the mlx versions, as i haven't had any looping in all the testing i've done, which is all my agentic and coding work the last couple of days (since it dropped). I have had tool_call misses 4 or 5 times so far, which isn't ideal but no looping. First I used it in pi-mono and later when i realized it's a serious model switched to opencode.

My setup is llama.cpp running on a 3090 in WSL, unsloth IQ4_NL with those flags: --ctx-size 128000 \ --jinja \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ --threads 12 \ --gpu-layers 99 \ --no-warmup \ --no-mmap \ -fa on

3 comments

fy20 53 days ago

Running it on a Macbook Pro M5 48GB:

        -hf unsloth/Qwen3.6-27B-GGUF:UD-Q6_K_XL \ 
        -c 128000 \
        --parallel 1 \
        --flash-attn on \
        --no-context-shift \
        --cache-type-k q8_0 \
        --cache-type-v q8_0 \
        --temp 0.6 \
        --top-p 0.95 \
        --top-k 20 \
        --min-p 0.0 \
        --presence_penalty 0.0 \
        --reasoning on \
        --jinja \
        --chat-template-kwargs "{\"preserve_thinking\": true}" \
        --spec-type ngram-simple \
        --draft-max 64 \
        --timeout 1800

Maybe someone knows any tips to optimise prompt processing as that's the slowest part? It takes a few minutes before OpenCode with ~20k initial context first responds, but subsequent responses are pretty fast due to caching.

link

jonaustin 53 days ago

https://github.com/jundot/omlx

note: 27b is going to be slow; use the 35b MoE if you want decent token/sec speed.

link

dexterlagan 52 days ago

Many of us tested 27B and 35B side by side, and the dense model is significantly smarter. It indeed is slower, but 35B makes a lot of mistakes 27B doesn't.

link

sleepyeldrazi 53 days ago

I haven't honestly dug around to figure out if there's a hardware reason for it, but prompt processing has always been a lot slower for me on macs in general. I mostly use MLX on my 24GB M4 Pro though, so I will pull llama.cpp on it as well to see what the prefill is like.

I've gotten around 16 t/s gen with 4bit and mxfp4 on that model for generation. The 3090 I mentioned has a little over 900 gb/s, while those macs i think are around 270 GB/s. If my understanding is correct, macs do utilize the bandwidth better in this case, but it still doesn't make up the difference (on the 3090 it's around 30-35 t/s depending on size of ctx).

Also, do run a quick experiment removing the cache quants if you want to tinker with it a bit more, iirc KV quant does add a small overhead during prefill.

I would be very interested to know your prefill and generation numbers.

link

fmajid 51 days ago

Combine that with full ZIMs of Wikipedia and Stack Overflow, plus documentation of your languages of choice, and you should be golden. I have 4TB SSDs in almost all my laptops (except Macs due to Tim Apple's price-gouging, but I am transitioning away from macOS), and I sync my entire eBook library as well so I am fully covered on the reference manual front.

link

cadamsdotcom 53 days ago

> I have been testing

With local models which are often benchmaxxed, testing unfortunately isn’t as predictive as you’d like.

link

sleepyeldrazi 52 days ago

I specifically tested on tasks I designed because I know every modern model, not only local ones, are bechmaxxed. The common benchmarks most labs use are (very likely) in their datasets to a degree (I'm assuming unintentionally, but is still highly probable) and there was a recent report on how easy it is to actually cheat them, as shown by people at UC Berkeley https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

That is precisely why my testing has been daily driving the model for everything + 8 tasks in a domain I care about. Could there be something very similar in their datasets? Of course, at least for most of the tasks, but if that lead to the good performance experience and results I'm getting, I am personally ok with that. I don't care how high the numbers are on the common benchmarks, only if it works well enough for me.

And if this model doesn't work for you, that's perfectly ok. Everyone has different needs from models. I was just impressed that it did for me, as it was a first from a local model.

link