Hacker News new | ask | show | jobs
by famouswaffles 563 days ago
>The simplest benchmark merely changes things like "john had 4 apples" to "Mary had 4 oranges."

Those models (4o, o1-mini, preview) don't see any drop at all on those benchmarks. The only benchmark that see drops with the SOTA models is the one they add, "seemingly relevant but ultimately irrelevant information".

Humans can and do drop in performance when presented with such alterations. Are they better than LLMs in that case ? Who knows ? Because these papers don't bother testing human baselines.