Hacker News new | ask | show | jobs
by obblekk 201 days ago
80% on swebench verified is incredible. a year ago the best model was at ~30%. i wonder if we'll soon have a convincingly superhuman coding capability (even in a narrow field like kernel optimization).

this is the most interesting time for software tools since compilers and static typechecking was invented.

1 comments

Last year’s model were at 50-60% on SWE bench-verified actually
I see 25-29% here https://www.swebench.com/viewer.html for models released in Nov 2024 albeit not verified. gpt4o (Aug 2024) was 33% for swe bench verified.

Important point because people have a bias to underestimate the speed of ai progress.

Do you people think nobody calls your bluff?

Here’s the launch card of the sonnet 3.5 from a year and a month ago. Guess the number. Ok, Ill tell you: 49.0%. So yeah, the comment you replied to was not really off.

https://www.anthropic.com/news/3-5-models-and-computer-use