| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by obblekk 249 days ago
	80% on swebench verified is incredible. a year ago the best model was at ~30%. i wonder if we'll soon have a convincingly superhuman coding capability (even in a narrow field like kernel optimization). this is the most interesting time for software tools since compilers and static typechecking was invented.

1 comments

quantumHazer 249 days ago

Last year’s model were at 50-60% on SWE bench-verified actually

link

obblekk 249 days ago

I see 25-29% here https://www.swebench.com/viewer.html for models released in Nov 2024 albeit not verified. gpt4o (Aug 2024) was 33% for swe bench verified.

Important point because people have a bias to underestimate the speed of ai progress.

link

tymscar 248 days ago

Do you people think nobody calls your bluff?

Here’s the launch card of the sonnet 3.5 from a year and a month ago. Guess the number. Ok, Ill tell you: 49.0%. So yeah, the comment you replied to was not really off.

https://www.anthropic.com/news/3-5-models-and-computer-use

link