|
|
|
|
|
by menaerus
520 days ago
|
|
Perhaps what he meant is that the public will be able to benchmark the model themselves by throwing different difficulty math problems at it and not necessarily the FrontierMath benchmark. It should become pretty obvious if they were faking the results or not. |
|
There's absolutely no comeuppance for juicing benchmarks, especially ones no one has access to. If performance of o3 doesn't meet expectations, there'll be plenty of people making excuses for it ("You're prompting it wrong!", "That's just not its domain!").
[0] https://openreview.net/forum?id=YXnwlZe0yf¬eId=yrsGpHd0Sf