| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lordmauve 25 days ago
	Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer. Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

2 comments

phainopepla2 25 days ago

I'm highly skeptical of DeepSWE. It rates GPT-5.4-mini as three times better than deepseek-v4-pro, but every time I use GPT-5.4-mini I find that it completely sucks at following directions.

link

gck1 25 days ago

Yeah, I share the same sentiment. I have yet to find a task where gpt-5.4-mini isn't bordering unusable.

link

lordmauve 25 days ago

I don't know if DeepSWE is genuinely a good benchmark. It's more important that their analysis demolished the validity of SWE-Bench Pro, objectively: it is being mismarked.

I think that buys enough credibility to propose an alternative.

I think there's a case to answer if Anthropic models underperform on a novel benchmark. I'd like to see more novel benchmarks to get a clearer picture.

link

sourcecodeplz 25 days ago

It is the extra-high thinking, in artificialanalysis.ai it uses 240m tokens vs 40 GPT5.4/5, not worth it even with low price.

link

mordae 25 days ago

This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.

link

lordmauve 25 days ago

Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

link