| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 55 days ago
	Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

1 comments

Majromax 54 days ago

> Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds

That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.

Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.

yorwba 54 days ago

Here's a sample-size calculator that may help illustrate the issue: https://sample-size.net/sample-size-proportions/ Put in the benchmark score of one model as p₀ and of the other model as p₁ (as a fraction between 0 and 1) and observe what kind of sample size you need to reliably observe a significant difference. The largest change between GPT 5.2 and 5.4 highlighted in https://openai.com/index/introducing-gpt-5-4/ is OSWorld-Verified going from 47.3% to to 75.0%. That's quite the difference, right? So plug in 0.473 and 0.75 and note that the required sample size per model is 55. For the software engineering tasks in SWE-Bench Pro, the change from 55.6% to 57.7% is a whopping 2.1 percentage points, which you can detect with a mere 8836 samples.

I'm sure someone in charge of benchmarking at OpenAI knows how statistics work and always makes sure to take a sufficiently large number of samples when comparing different models, but for most other people who want to know which model is better, the answer is unlikely to be worth the cost of measuring it precisely enough to find out.