| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jbellis 58 days ago
	For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking tasks (best-of-two), compared to 10/98 for the same size qwen 3.5. So it's at best very slightly improved and not at all in the class of qwen 3.5 27b dense (26 solved) let alone opus (95/98 solved, for 4.6).

3 comments

kristianp 58 days ago

This has similar problems to swe bench in that models are likely trained on the same open source projects that the benchmark uses.

https://blog.brokk.ai/introducing-the-brokk-power-ranking/

link

yorwba 58 days ago

If all models are trained on the benchmark data, you cannot extrapolate the benchmark scores to performance on unseen data, but the ranking of different models still tells you something. A model that solves 95/98 benchmark problems may turn out much worse than that in real life, but probably not much worse than the one that only solved 11/98 despite training on the benchmark problems.

This doesn't hold if some models trained on the benchmark and some didn't, but you can fix this by deliberately fine-tuning all models for the benchmark before comparing them. For more in-depth discussion of this, see https://mlbenchmarks.org/11-evaluating-language-models.html#...

link

spwa4 58 days ago

It is much faster though. On my m1 max, describing a picture (quick way to get a pretty large context):

Qwen 3.6 35b a3b: 34 tok/sec

Qwen 3.5 27b: 10 tok/sec

Qwen 3.5 35b a3b: doesn't support image input

link

upboundspiral 58 days ago

I've been using Qwen 3.5 35B-A3B with images as input so I suspect you perhaps didn't include the vision part of the model during testing (I use llama.cpp and I learned I needed to include the separate mmproj part).

link

m-emre 56 days ago

What is the quantization level of your Owen 3.6 3b model?

link

__natty__ 58 days ago

You compare tiny modal for local inference vs propertiary, expensive frontier model. It would be more fair to compare against similar priced model or tiny frontier models like haiku, flash or gpt nano.

link

javawizard 58 days ago

Not when the article they're commenting on was doing literally exactly the same thing.

link

ericd 58 days ago

Eh it’s important perspective, lest someone start thinking they can drop $5k on a laptop and be free of Anthropic/OpenAI. Expensive lesson.

link