| HN Mirror

Here's a sample-size calculator that may help illustrate the issue: https://sample-size.net/sample-size-proportions/ Put in the benchmark score of one model as p₀ and of the other model as p₁ (as a fraction between 0 and 1) and observe what kind of sample size you need to reliably observe a significant difference. The largest change between GPT 5.2 and 5.4 highlighted in https://openai.com/index/introducing-gpt-5-4/ is OSWorld-Verified going from 47.3% to to 75.0%. That's quite the difference, right? So plug in 0.473 and 0.75 and note that the required sample size per model is 55. For the software engineering tasks in SWE-Bench Pro, the change from 55.6% to 57.7% is a whopping 2.1 percentage points, which you can detect with a mere 8836 samples.

I'm sure someone in charge of benchmarking at OpenAI knows how statistics work and always makes sure to take a sufficiently large number of samples when comparing different models, but for most other people who want to know which model is better, the answer is unlikely to be worth the cost of measuring it precisely enough to find out.