|
|
|
|
|
by yorwba
55 days ago
|
|
Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.) |
|
That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.
Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.