| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by 7moritz7 310 days ago

Qwen3 is substantially better in my local testing. As in, adheres to the prompt better (pretty much exactly for the 32B parameter variant, very impressive) and is more organic sounding.

In simplebench gpt-oss (120 bn) flopped hard so it doesn't appear particularly good at logical puzzles either.

So presumably, this comes down to...

- training technique or data

- dimension

- lower number of large experts vs higher number of small experts

5 comments

jszymborski 310 days ago

If I had to make a guess, I'd say this has much, much less to do with the architecture and far more to do with the data and training pipeline. Many have speculated that gpt-oss has adopted a Phi-like synthetic-only dataset and focused mostly on gaming metrics, and I've found the evidence so far to be sufficiently compelling.

7moritz7 310 days ago

That would be interesting. I've been a bit sceptical of the entire strategy from the beginning. If oss was actually as good as o3 mini and in some cases o4 mini outside benchmarks, that would undermine openai's api offer for gpt 5 nano and maybe mini too.

Edit: found this analysis, it's on the HN frontpage right now

> this thing is clearly trained via RL to think and solve tasks for specific reasoning benchmarks. nothing else.

https://x.com/jxmnop/status/1953899426075816164

CuriouslyC 310 days ago

The strategy of Phi isn't bad, it's just not general. It's really a model that's meant to be fine tuned, but unfortunately fine tuning tends to shit on RL'd behavior, so it ended up not being that useful. If someone made a Phi style model with an architecture that was designed to take knowledge adapters/experts (i.e. small MoE model designed to get separately trained networks plugged into them with routing updates via special LoRA) it'd actually be super useful.

adastra22 310 days ago

The Phi strategy is bad. It results in very bad models that are useless in production, while gaming the benchmark to appear like it is actually able to do something. This is objectively bad.

CuriouslyC 310 days ago

I like the idea of having a _HIGHLY_ unopinionated base model that's just good at basic logic and instruction following that I can fine tune to my use case. Sadly, full fine tuning tends to make models derpy, and LoRAs are limited in terms of what they can achieve.

adastra22 310 days ago

That seems unrelated? I think we are talking about past each other. Phi was trained on purely synthetic data derived from emulating the benchmark suite. Not surprisingly, this resulted in state of the art scores. And a model that was 100% useless at anything other than making the benchmark number go up.

johnisgood 310 days ago

Is there an URL to the post itself on somewhere else?

unstatusthequo 310 days ago

Yes. I tried to ask oss-gpt to ask me a riddle. The response was absurd. Came up with a nonsensical question, then told me the answer. The answer was a four letter “word” that wasn’t actually a real word.

“What is the word that starts with S, ends with E, and contains A? → SAEA”

Then when I said that’s not a word and you gave me the answer already, no fun, it said

“I do not have access to confirm that word.”

verisimi 310 days ago

lol. The answer it gave doesn't even end in an 'E'.

threeducks 310 days ago

FWIW, I asked gpt-oss-120b this question 10 times and the answer was always "sauce", "sane" or "sale". I also tried different temperatures (from 0 to 1), which did not seem to have an effect on the correctness of the answer.

EDIT: I now have also questioned the smaller gpt-oss-20b (free) 10 times via OpenRouter (default settings, provider was AtlasCloud) and the answers were: sage, sane, sane, space, sane, sane, sane, sane, space, sane.

You are either very unlucky, your configuration is suboptimal (weird system prompt perhaps?) or there is some bug in whichever system you are using for inference.

yunusabd 310 days ago

GP asked the model to _create_ a riddle, not solve a given one.

threeducks 310 days ago

Yes, but the odds of getting GPT-OSS to respond with that riddle are pretty low and it is not necessary to demonstrate whether the LLM can answer the riddle correctly.

vidarh 310 days ago

They said it provided the answer when it created the riddle. They didn't question itd ability to solve it.

roscas 310 days ago

"The word is SAME." for qwen3-coder and it took like 10 seconds to write all the thinking because it writes it down, but processing was 0 seconds.

"I need to find a word that starts with S, ends with E, and contains A.

Let me think through this step by step:

    Starts with S
    Ends with E
    Contains A somewhere in the middle

One word that fits this pattern is "SAGE" - but that only has 4 letters and would be S-A-G-E, which doesn't work because it doesn't contain A in the middle.

Wait, let me reconsider. The word "SAGE" actually works if I read it correctly:

    Starts with S
    Ends with E  
    Contains A

Actually, "SAGE" is S-A-G-E, so it starts with S and ends with E, but it contains A in the second position.

Let me think of other possibilities:

    "SQUARE" - starts with S, ends with E, contains A (in the middle)
    "SALE" - starts with S, ends with E, contains A
    "SAME" - starts with S, ends with E, contains A

Wait, "SAME" is S-A-M-E - that works! It starts with S, ends with E, and contains A.

The word is SAME. "

bee_rider 310 days ago

This is tangential because the task was to come up with the riddle, not solve it.

But, do reasoning models usually do this poorly?

It comes up with a valid solution, SAGE, then disqualifies it for incomprehensible reasons.

Then it discovers that SAGE works if it “reads it carefully.” But then seems to disqualify it(?), or at least goes to list other words for some reason.

Then it comes up with SAME, a word… with exactly the same shape as SAGE, just swapped out the irrelevant letter.

What is going on here? Is it programmed to constantly second-guess itself to make it better at finding weaknesses to its answers to harder riddles? But since it doesn’t know how to accept a good answer, it seems like it is just rolling the dice and then stopping at a random point.

I guess it is technically right, but the logic is a total mess.

yorwba 309 days ago

The model isn't explicitly programmed to constantly second-guess itself, but when you do reinforcement learning with verifiable rewards (RLVR) where only the final answer is verified, even completely nonsensical reasoning can accidentally be rewarded if it gives correct results often enough.

E.g. if the model can generate multiple candidate solutions that are all equally likely (or unlikely) to be correct, it doesn't matter whether you stop at the first one or keep going until a random later one. But if the model can pick the correct solution from multiple candidates better than choosing uniformly at random, generating more candidates becomes an advantage, even if it sometimes results in discarding a correct solution in favor of another one.

adastra22 310 days ago

He was asking the llm to come up with the riddle.

faangguyindia 310 days ago

this is exactly why strongest model gonna lose out to weaker models if the later ones have more data

for example, i was using deep seek webui and getting decent on point answers but it simply does not have latest data.

So, while Deep Seek R1 might be better model than Grok3 or even Grok4, it not having access to "twitter data" basically puts it behind.

Same is case with OpenAI, if OpenAI has access to fast data from github, it can help with bugfixs which claude/gemini2.5 pro can't.

model can be smarter but if it does not have the data to base its inference upon it's useless.

fspeech 310 days ago

On the open source library part, you can ask DeepWiki the questions yourself and feed the answers to the LLMs by hand. DeepWiki gives you high quality answers because they are grounded in code and you can check the veracity yourself.

omneity 310 days ago

Qwen3 32B is a dense model, it uses all its parameters all the time. GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time. It’s a tradeoff that makes it faster to run than a dense 20B model and much smarter than a 3.6B one.

In practice the fairest comparison would be to a dense ~8B model. Qwen Coder 30B A3B is a good sparse comparison point as well.

bee_rider 309 days ago

Tangential question from an outsider:

When people talk about sparse or dense models, are they spare or dense matrices in the conventional numerical linear algebra sense? (Something like a csr matrix?)

selcuka 310 days ago

> GPT OSS 20B is a sparse MoE model. This means it only uses a fraction (3.6B) at a time.

They compared it to GPT OSS 120B, which activates 5.1B parameters per token. Given the size of the model it's more than fair to compare it to Qwen3 32B.

Mars008 310 days ago

You call it fair? 32 / 5.1 > 6, it's takes 6 times more to compute each token. Put it other way, Qwen3 32B is 6 times slower than GPT OSS 120B.

kgeist 310 days ago

>Qwen3 32B is 6 times slower than GPT OSS 120B.

Only if 120B fits entirely in the GPU. Otherwise, for me, with a consumer GPU that only has 32 GB VRAM, gpt-oss 120B is actually 2 times slower than Qwen3 32B (37 tok/sec vs. 65 tok/sec)

selcuka 310 days ago

We are talking about accuracy, though. I don't see the point of MoE if a 120B MoE model is not as accurate as even a 32B model.

littlestymaar 310 days ago

I've read many times that MoE models should be comparable to dense models with a number of parameters equal to the geometric mean of the MoE's total number of parameters and active ones.

In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

Mars008 309 days ago

Not sure there is on formula. Because there are two different cases:

1) performance constrained. like NVidia Spark with 128GB or AGX with 64GB.

2) memory constrained. like consumers' GPUs.

In first case MoE is clear win. They fit and run faster. In second case dense models will produce better results. And if performance in token/sec is acceptable then they are better choice.

selcuka 309 days ago

> In the case of gpt-oss 120B that would means sqrt(5*120)=24B.

That's actually in line with what I had (unscientifically) expected. Claude Sonnet 4 seems to agree:

> The most accurate approach for your specific 120B MoE (5.1B active) would be to test it empirically against dense models in the 10-30B range.

kgeist 309 days ago

I've read that the formula is based on the early Mistral models and does not necessarily reflect what's going on nowadays.

BoorishBears 310 days ago

MoE expected performance = sqrt(active heads * total parameter count)

sqrt(120*5) ~= 24

GPT-OSS 120B is effectively a 24B parameter model with the speed of a much smaller model

faangguyindia 310 days ago

yesterday, i signed up for qwen3-coder-plus. It fails 4/10 "diff" edit format in various code editing tools i use.

Gemini Pro 2.5 with diff fenced edit format, rarely fails. So i don't see this Qwen3 hype unless i am using wrong edit format, can anyone tell me which edit format will work better with Qwen3?

https://aider.chat/docs/more/edit-formats.html

eurekin 309 days ago

I'm running a30-a3b-instruct q6 quant on exllamav2 and checked few simple tasks in roo and cline. Prompt adherence, tool calling and file changing worked flawlessly

faangguyindia 309 days ago

okay turns out i was using it in aider with wrong edit format in editor mode, i switched to editor edit format it has not failed so far.

wickedsight 309 days ago

Maybe I'm doing something wrong, but in my testing with Roo and Qwen3-Coder-30B via MLX, it constantly ends up in loops and often doesn't manage to finish editing a file, leaving it half finished.

If I give it really simple, straight forward tasks it works quite nice though.

cranberryturkey 310 days ago

qwen3 is slow though. i used it. it worked, but it was slow and lacking features.

kgeist 310 days ago

On my RTX 5090 with llama.cpp:

gpt-oss 120B - 37 tok/sec (with CPU offloading, doesn't fit in the GPU entirely)

Qwen3 32B - 65 tok/sec

Qwen3 30B-A3B - 150 tok/sec

(all at 4-bit)

xfalcox 310 days ago

Qwen 3 is not slow by any metrics.

Which model, inference software and hardware are you running it on?

The 30BA3B variant flies on any GPU.

SchemaLoad 310 days ago

GPT-OSS is slow too. Gemma3 gives me better results and runs faster.