| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by georgewsinger 546 days ago
	Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks? Am I missing something?

3 comments

anothermathbozo 546 days ago

I think this is with and without "tools." They explain it in the system card:

> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.

> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.

Bjorkbat 546 days ago

So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.

pockmarked19 546 days ago

YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).

galaxyLogic 545 days ago

My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .

IanCal 545 days ago

o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.

georgewsinger 546 days ago

Makes sense. Thanks for the correction.

jakereps 546 days ago

The caption on the graph explains.

> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .

I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.

DrewHintz 546 days ago

I'm guessing an "internal tools scaffold" is something like Goose: https://github.com/block/goose

Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.

logicchains 546 days ago

Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".

ben_w 546 days ago

> lobotomise

Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".

Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.

And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...

bee_rider 546 days ago

If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.

I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.

ben_w 546 days ago

It's still a bug. Not really working as intended — it doesn't sell anything from that.

A very funny bug, but a bug nonetheless.

And given this was shared via screenshots, it was done for a laugh.

thrwthsnw 545 days ago

Who determines who gets access to what information? The OpenAI board? Sam? What qualifies as dangerous information? Maybe it’s dangerous to allow the model to answer questions about a person. What happens when limiting information becomes a service you can sell? For the right price anything can become too dangerous for the average person to know about.

ben_w 545 days ago

> What qualifies as dangerous information?

The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.

OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.

And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.

Rastonbury 546 days ago

They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs

stavros 545 days ago

The problem is that they don't make the LLM better at instruction following, they just make it unable to product furry erotica even if Amazon wants it to.

AbstractH24 545 days ago

> Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Isn’t that exactly what VCs want?

ben_w 545 days ago

I doubt it.

The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".

The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")

kkzz99 546 days ago

Or the number was never real to begin with.