| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ZeroCool2u 102 days ago

Bit concerning that we see in some cases significantly worse results when enabling thinking. Especially for Math, but also in the browser agent benchmark.

Not sure if this is more concerning for the test time compute paradigm or the underlying model itself.

Maybe I'm misunderstanding something though? I'm assuming 5.4 and 5.4 Thinking are the same underlying model and that's not just marketing.

3 comments

oersted 102 days ago

I believe you are looking at GPT 5.4 Pro. It's confusing in the context of subscription plan names, Gemini naming and such. But they've had the Pro version of the GPT 5 models (and I believe o3 and o1 too) for a while.

It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

Not sure what it is exactly, I assume it's probably the non-quantized version of the model or something like that.

link

logicchains 102 days ago

>It's the one you have access to with the top ~$200 subscription and it's available through the API for a MUCH higher price ($2.5/$15 vs $30/$180 for 5.4 per 1M tokens), but the performance improvement is marginal.

The performance improvement isn't marginal if you're doing something particularly novel/difficult.

link

nsingh2 102 days ago

From what I've read online it's not necessarily a unquantized version, it seems to go through longer reasoning traces and runs multiple reasoning traces at once. Probably overkill for most tasks.

link

ZeroCool2u 102 days ago

Yup, that was it. Didn't realize they're different models. I suppose naming has never been OpenAI's strong suit.

link

highfrequency 102 days ago

Can you be more specific about which math results you are talking about? Looks like significant improvement on FrontierMath esp for the Pro model (most inference time compute).

link

ZeroCool2u 102 days ago

Frontier Math, GPQA Diamond, and Browsecomp are the benchmarks I noticed this on.

link

csnweb 102 days ago

Are you may be comparing the pro model to the non pro model with thinking? Granted it’s a bit confusing but the pro model is 10 times more expensive and probably much larger as well.

link

ZeroCool2u 102 days ago

Ah yes, okay that makes more sense!

link

andoando 102 days ago

The thinking models are additionally trained with reinforcement learning to produce chain of thought reasoning

link