| HN Mirror

No individual Kaggle solution achieved a result of 81%, rather an ensemble of models: https://x.com/fchollet/status/1865865271728390515

In my (possibly flawed) interpretation: o3's scores appear to be an achievement because they were attained by a single model, but the benchmark itself needs refinement before it can claim to be a measure of AGI like it set out to be, as one can bruteforce their way to similar results.