| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by camelmel 59 days ago

Huh, according to that model card this is a 137B total parameter model.

Performance doesn't seem that good:

- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro

- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)

They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.

6 comments

davecitron 59 days ago

Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).

On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.

easygenes 59 days ago

Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.

Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/

kosolam 58 days ago

Hey Dave, I’d love to add your new model in the harness I’m going to opensource very soonish. Going to publish benchmarks on real world tasks.

dabockster 58 days ago

Qwen HAS to be a part of the discussion here, even though Microsoft is a US based entity. Their 30b MoE models absolutely hit way above their weight when paired with the right harness program, and can be ran on "Costco gaming computer" specs when configured correctly in llama.cpp.

Sorry Trump Administration, but while the US has been downloading more ram by throwing data centers at everything and burning up everyone's power and water, China has come out with what's effectively a prototype edge compute capable AI model - regardless of how they built it. And arguably I can tokenmaxx on it just fine at around 30-40 tokens/sec.

And also, ASICs are on the way. Imagine one of those with a heavy hitting model (MoE or otherwise, Qwen or otherwise) installed in a PCIe slot at 10k+ tokens/sec and 75 watts max (maximum wattage deliverable by the PCIe slot alone) for $300-400 USD each.

https://taalas.com/the-path-to-ubiquitous-ai/

ASIC demo here: https://chatjimmy.ai/

Sorry/not sorry to rip this whole thing to shreds. But I'm sick and tired of these inefficient LLMs being produced that seemingly can only be offered by subscription from a data center, when I'm running a full AI stack right now (model and all) on my computer at home on a 750 watt max power supply. Microsoft really needs to get with the picture here and compete more with Qwen instead of just the US/EU entities.

Sincerely, your neighbor down in Tacoma. https://www.youtube.com/watch?v=V9jlo4Ht2YA&t=229s

giancarlostoro 59 days ago

The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.

mdasen 59 days ago

Yes, it's a "smaller" (137B) model that competes with Haiku, but it's basically the performance of Qwen3.6-35B-A3B which is 75% smaller and 98% smaller in terms of active parameters (since it's a mixture of experts model). Microsoft should be comparing its model to good smaller models, not Haiku 4.5.

Qwen-3.6-27b is closer to Claude Opus 4.7 than it is to Haiku 4.5 in a lot of benchmarks - and it's way smaller than Microsoft's new model.

Sure, it competes with Haiku, but it shows how far Microsoft is behind lots of other small models that are available.

stingraycharles 59 days ago

I understand what you’re saying, but I am generally very careful when comparing models and their benchmarks; benchmarks often don’t really match “real world” quality.

yorwba 59 days ago

The technical report https://microsoft.ai/wp-content/uploads/2026/06/main_2026060... has a lot of detail about decontaminating their training data and developing new in-house benchmarks to ensure reliable evaluation. If other models were just overfit to public benchmarks while Microsoft produced something that generalizes better to unseen data, they could've used those in-house benchmarks to argue that point.

Instead, they only do cherry-picked comparisons against Anthropic's small models, and not the full spectrum of competitors.

Without evidence to the contrary, I'll interpret this as just what happens when you're late to the party and insist on doing everything from scratch.

Maybe coaxing reasoning behavior out of their base model without kickstarting it by distilling from existing models provided them with valuable experience that will help improve their future models, or maybe it was an unnecessary waste of time.

fmajid 58 days ago

If their model was trained purely on properly licensed data, the reduced legal liability could be a selling point

IanCal 59 days ago

> 98% smaller in terms of active parameters (since it's a mixture of experts model).

I don’t think that’s right, this flash model is 5B active params. Qwen3.6-35B-A3B is 3B so 40% smaller.

minraws 59 days ago

They did release, MAI-Thinking-1 to compete with Sonnet. Totally not sure why that isn't at the top here.

ignoramous 58 days ago

Can't yet use MAI-Thinking-1? [0] And no indication of it being made available in GitHub Copilot, either.

[0] Not even here: https://playground.microsoft.ai/

giancarlostoro 59 days ago

Good question, and I missed that entirely!

lostmsu 59 days ago

Compete? It is behind Kimi K2.6, which is in turn away behind Sonnet.

sfifs 58 days ago

Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.

[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...

abustamam 58 days ago

How does qwen compare to deepseek or kimi? I haven't spent much time with qwen but I find deepseek to be mostly comparable to opus for my pet projects. Kimi k2.6 did a lot of stupid stuff and talked to itself a lot "let me do X... Wait, X doesn't make sense because the user explicitly said Y"

Deepseek seems to seek first to understand before going off.

sfifs 57 days ago

Deepseek is too large for me to self host on Spark. I was actually using Deepseek as my cloud backup and it performed well but then read the T&C which doesn't give as strong data protection guarantees unlike Google and Alibaba. Kimi is again massive and cloud hosted APIs are fairly expensive compared and it also has weak T&C, so have only benched but not tested. In general I found that with OpenClaw it works better to turn Reasoning off.

I think there's possibly value to try fine tuning Qwen 3.5 on my OpenClaw turns log to see if performance improves. The one recent model I haven't tested yet is Nemotron 3 Super which I might bench soon.

sfifs 47 days ago

As an update, turns out Antirez created a brilliant 2 bit quant of Deepseek to fit into 128Gb systems along with a custom highlight tuned server. I've been running this the last few days and if I turn off envelope on OpenClaw, the performance is brilliant. Still to try with coding harnesses. It's a bit slow compared to the other models but so good that I'm willing to put up :-) https://github.com/antirez/ds4

kristjansson 59 days ago

> 137B-A5B

Yeah, not a 5B param model as the earlier title implied!

epolanski 59 days ago

So what other models use less than half of Haiku's tokens while providing higher success rate?

akie 59 days ago

Why is Haiku the benchmark though, with code generation don't we primarily care about the quality of the code - not the speed or efficiency at which it's generated?

NitpickLawyer 59 days ago

You would be surprised how much code haiku writes behind the scenes. With the whole 'plan w/ opus, spawn subagents w/ haiku' that cc does. And you'd be surprised how useful the small models can be under some guidance / hand holding. You can daily-drive gpt5-mini and still find it useful. They're not as good as the big ones, obviously, and can't handle a project start-to-finish on their own, but given a well-scoped task, they'll do it just fine.

epolanski 59 days ago

I'm not sure I follow, but I'll give you a very fresh example.

I was implementing a re-print functionality in my warehouse management system.

It took Opus 4.8 high 24m1s and 87k tokens. Took Haiku 6m30s and 41k tokens.

After that time I had to provide (minor) adjustments to both. But Haiku allowed me to iterate faster. Code quality for that somewhat trivial use case was similar.

Actually, I would even say that Opus provided a sub par solution: instead of fixing an issue where carrier label pdf wasn't saved as the state machine progressed to the latest step, it went through a much complex solution of re-generating those by scratch. Which is also wrong, as it was de-facto booking the carriers twice for the same order.

Haiku simply added another field on the terminal state that carried the already generated urls.

I don't think it's a good idea to default to highest effort/bigger model without taking into account the time it takes and the task complexity.

Imho we should experiment rather than assume that what the rest of the community does to be the best practice.

vinzenzu 59 days ago

Totally agree. I've been using cheap Chinese open-source models via OpenCode Go, and they are faster, cheaper and in my experience arrive at the solution quicker because they are more pragmatic.

Yesterday Codex was making a big issue out of a new module that was upgraded in our cluster and because of which the same SSH key would be "regenerated" by Terraform. No big deal, it just truncates a newline at the end of the SSH key and it works all the same. But not being aware that this, as an example, is unimportant can cost a lot more time than using the big models saves.

easygenes 59 days ago

While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:

  Qwen3.6-35B-A3B   vs   Claude Haiku 4.5
    reasoning mode · AA Intelligence Index v4.0
  
  46.0 ┤   ↖ better — cheaper · smarter · faster
       │
       │
  44.0 ┤     ╭─────╮
       │     │  ●  │ Qwen3.6-35B-A3B
       │     ╰─────╯
  42.0 ┤
       │
       │
  40.0 ┤
       │
       │
  38.0 ┤                                       ╭───╮
       │                      Claude Haiku 4.5 │ ○ │
       │                                       ╰───╯
  36.0 ┤
       └┬─────────┬─────────┬─────────┬─────────┬────────┬
        $200    $300      $400      $500      $600    $700
  
    x → cost to run the index (USD)        lower is better
    y → AA intelligence index              higher is better
  
    bubble area = output speed (tokens / sec)
          ╭─────╮                  ╭───╮
          │  ●  │ Qwen ~196 t/s    │ ○ │ Haiku ~93 t/s
          ╰─────╯                  ╰───╯
  
    ┌─────────────────────┬──────────┬──────────┬───────────┐
    │ model               │ AA index │ run cost │ out speed │
    ├─────────────────────┼──────────┼──────────┼───────────┤
    │ Qwen3.6-35B-A3B    ●│   43.5   │   $280   │  196 t/s  │
    │ Claude Haiku 4.5   ○│   37.1   │   $620   │   93 t/s  │
    └─────────────────────┴──────────┴──────────┴───────────┘


    COST PER TOKEN   ≠   COST PER TASK  
    output tokens per index run:
       Haiku 4.5    87.3M   (79.3M reasoning + 8.0M answer)
       Qwen3.6     143.2M   (131.7M reasoning + 11.5M answer)
       → Qwen emits 1.64× more output
  
    ── output speed (tokens / sec) ──────────  raw rate · higher = faster
       Qwen3.6     100%   ~196 t/s
       Haiku 4.5   ~47%   ~93 t/s
                                                  → Qwen ~2.1× faster per token
  
          ╎   1.64× more tokens  <  2.1× faster rate
          ▼
  
    ── solution speed (per finished answer) ──  higher = faster
       Qwen3.6     100%
       Haiku 4.5   ~78%
                                                  → Qwen ~1.3× FASTER to a solution
  
    SCORECARD
                            intelligence    cost / task     speed to solution
     Qwen3.6-35B-A3B        43.5            $280            ~1.3× faster 
     Claude Haiku 4.5       37.1            $620            (slower)
  
     → Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
       the raw-speed edge (2.1×), so Qwen stays ahead per task.

HarHarVeryFunny 58 days ago

How did you get that nicely formatted graph and table in your post ?!

Krysoph 58 days ago

> Text after a blank line that is indented by two or more spaces is formatted as code.

https://news.ycombinator.com/formatdoc

  crimes ↑
         │
   10.0  ┤                                           ● Airport burger
         │                                      ╭──────────────╮
    8.0  ┤                                      │  theft arc   │
         │                                      ╰──────────────╯
    6.0  ┤                         ● Five Guys
         │
    4.0  ┤              ● Food truck burger
         │
    2.0  ┤      ● McBurger
         │
    0.0  ┤ ● Homemade burger
         │
         └───────┬─────────┬─────────┬─────────┬─────────→ price
                $2        $8        $14       $22       $38

  ┌────────────────────┬────────┬──────────────┬────────────────────┐
  │ burger             │ price  │ crime index  │ expected behavior  │
  ├────────────────────┼────────┼──────────────┼────────────────────┤
  │ Homemade burger    │   $2   │          0.0 │ law-abiding citizen│
  │ McBurger           │   $6   │          1.4 │ steals extra napkin│
  │ Food truck burger  │  $11   │          3.1 │ lies about hunger  │
  │ Five Guys          │  $18   │          6.2 │ financial crime    │
  │ Airport burger     │  $34   │          9.7 │ enters villain arc │
  └────────────────────┴────────┴──────────────┴────────────────────┘

  conclusion: burger inflation is a gateway condiment

HarHarVeryFunny 58 days ago

Thanks, so in this case the value of "code fomatting" is using a fixed-width font ?

The next question is where did the "ASCII-art" graph and table come from? Are there sites to generate these?