| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mdasen 56 days ago

Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.

Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.

Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.

2 comments

stingraycharles 56 days ago

I understand what you’re saying, but I am generally very careful when comparing models and their benchmarks; benchmarks often don’t really match “real world” quality.

link

yorwba 56 days ago

The technical report https://microsoft.ai/wp-content/uploads/2026/06/main_2026060... has a lot of detail about decontaminating their training data and developing new in-house benchmarks to ensure reliable evaluation. If other models were just overfit to public benchmarks while Microsoft produced something that generalizes better to unseen data, they could've used those in-house benchmarks to argue that point.

Instead, they only do cherry-picked comparisons against Anthropic's small models, and not the full spectrum of competitors.

Without evidence to the contrary, I'll interpret this as just what happens when you're late to the party and insist on doing everything from scratch.

Maybe coaxing reasoning behavior out of their base model without kickstarting it by distilling from existing models provided them with valuable experience that will help improve their future models, or maybe it was an unnecessary waste of time.

link

fmajid 56 days ago

If their model was trained purely on properly licensed data, the reduced legal liability could be a selling point

link

IanCal 56 days ago

> 98% smaller in terms of active parameters (since it's a mixture of experts model).

I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.

link