Hacker News new | ask | show | jobs
by jiwidi 61 days ago
See original opus 4.6 sitting at 16% hallucination and the retest on 12th of april at 33%

They definitely must be doing some quantization or optimization to meet demand, otherwise why would model performance degrade this much? It's been crazy for me personally

1 comments

I'm the last person to defend any of these companies, but it's not a retest. The set of tasks is clearly different and results for the original tasks are nearly identical. It differs on one task where it previously had fabrications = 0 and now it's 1, which dropped the score from 100% to 86%

Combining multiple tests on the same leaderboard like this is nonsense, there should be a separate leaderbaord for the new tasks where every model is tested again.

Putting it on the original leaderboard as "Opus 4.6 (April 12)" is so obviously inappropriate that it smells like deception. You could say that the leaderboard is hallucinated.