| > an unsubstantiated claim about widespread misconduct. I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you. > Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of. > This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers. > someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs. [1] https://aclanthology.org/2024.naacl-long.482/
[2] https://arxiv.org/abs/2412.15194 |
The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope.
> "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement."
This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.
> "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself"
This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).
> "My credentials are in my profile, not that I think they should matter."
The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.
When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen.
This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context.