|
|
|
|
|
by Majromax
54 days ago
|
|
> Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations. Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark. |
|
I'm sure someone in charge of benchmarking at OpenAI knows how statistics work and always makes sure to take a sufficiently large number of samples when comparing different models, but for most other people who want to know which model is better, the answer is unlikely to be worth the cost of measuring it precisely enough to find out.