Hacker News new | ask | show | jobs
by dns_snek 75 days ago
I'm the last person to defend any of these companies, but it's not a retest. The set of tasks is clearly different and results for the original tasks are nearly identical. It differs on one task where it previously had fabrications = 0 and now it's 1, which dropped the score from 100% to 86%

Combining multiple tests on the same leaderboard like this is nonsense, there should be a separate leaderbaord for the new tasks where every model is tested again.

Putting it on the original leaderboard as "Opus 4.6 (April 12)" is so obviously inappropriate that it smells like deception. You could say that the leaderboard is hallucinated.