| Do people actually think OpenAI is gaming benchmarks? I know they have lost trust and credibility, especially on HN. But this is a company with a giant revenue opportunity to sell products that work. What works for enterprise is very different from “does it beat this benchmark”. No matter how nefarious you think sama is, everything points to “build intelligence as rapidly as possible” rather than “spin our wheels messing with benchmarks”. In fact, even if they did fully lie and game the benchmark - do you even care? As an OpenAI customer, all I care about is that the product works. I code with o1 for hours every day, so I am very excited for o3 to be released via API. And if they trained on private datasets, I honestly don’t care. I just want to get a better coding partner until I’m irrelevant. Final thought - why are these contractors owed a right to know where funding came from? I would definitely be proud to know I contributed to the advancement of the field of AI if I was included in this group. |
Many people compare models based on benchmarks. So if openAI can appear better to Anthropic, Google, or Meta, by gaming benchmarks, it's absolutely in their interest to do so, especially if their product is only slightly behind, because evaluating model quality is very very tricky business these days.
In particular, if there is a new benchmark, it's doubly in their interest to game it, because they know that other providers will start using and optimizing performance towards that benchmark, in order to "beat" OpenAI and win market share.
On a personal level, their model is getting beat handily by Claude Sonnet 3.5 right now. It doesn't seem to show in the benchmarks. I wonder why?
This is a company which is shedding their coats of ethics and scientific rigor -- so as to be as unencumbered as possible in its footrace to the dollar.