| > "I can't prove it, but I heard it from multiple people in the industry" The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope. > "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement." This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter. > "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself" This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them). > "My credentials are in my profile, not that I think they should matter." The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence. When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen. This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context. |
Validation is not training, period. I'll ask again: what is the possible goal of accessing the evaluation set if you don't plan to use it for anything except the final evaluation, which is what the test set is used for? Either they just asked for access without any intent to use the provided data in any way except for final evaluation, which can be done without access, or they did somehow utilize the provided data, whether by training on it (which they verbally promised not to), using it as a validation set, using it to create a similar training set, or something else.
> This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.
OpenAI is not doing science; they are doing business.
> This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).
The metrics matter to people, but this doesn't mean people can meaningfully predict the model's performance using them. If I were trying to describe each of your arguments as some demagogue technique (you're going to call it ad hominem or something, probably), then I'd say it's a false dichotomy: it can, in fact, be impossible to use metrics to predict performance precisely enough and for people to care about metrics simultaneously.
> The attempted simultaneous appeal to and dismissal of credentials
I'm not appealing to credentials. Based on what I wrote, you made a wrong guess about my credentials, and I pointed out your mistake.
> at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.
Your position, on the other hand, rests on the assumption that corporations behave ethically and with integrity beyond what is required by the law (and, specifically, their contracts with other entities).