Hacker News new | ask | show | jobs
by perrygeo 479 days ago
The solution moving forward has to be private benchmark suites. I could see teams investing in their own set of programming challenges and periodically re-evaluating them - similar to how we would construct sets of live interview questions for candidates and qualitatively assess their ability.

It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.

Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.