| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by refulgentis 511 days ago

Its increasingly odd to see HN activity that assumes the premise: if the latest benchmark results involved a benchmark that can be shown to have any data that OpenAI could have accessed, then, the benchmark results were intentionally faked.

Last time this confused a bunch of people who didn't understand what test vs. train data meant and it resulted in a particular luminary complaining on Twitter, to much guffaws, how troubling the situation was.

Literally every comment currently, modulo [1] assumes this and then goes several steps more, and a majority are wildly misusing terms with precise meanings, explaining at least part of their confusion.

[1] modulo the one saying this is irrelevant because we'll know if it's bad when it comes out, which to be fair, if evaluated rationally, we know that doesn't help us narrowly with our suspicion FrontierMath benchmarks are all invalid because it trained on (most of) the solutions

1 comments

EvgeniyZh 510 days ago

Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

As for "knowing it's bad", most people won't be able to tell a model scoring 25% and 10% apart. People who are using these models to solve math problems are tiny share of users and even tinier share of revenues. What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

link

refulgentis 510 days ago

> Why wouldn't OpenAI cheat? It's an open secret in industry that benchmarks are trained on. Everybody does it, so you need to do that or else your similarly performing model will look worse on paper.

This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

> And even they respect the agreement, even using test set as a validation set can be a huge advantage. That's why validation set and test set are two different terms with precise meaning.

This part accidentally stumbles into a valid point about ML methodology while completely missing why it matters. Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them. The implication that this basic principle somehow proves misconduct is backwards logic.

> People who are using these models to solve math problems are tiny share of users and even tinier share of revenues.

This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities. This is basic ML evaluation theory.

> What OpenAI needs is to convince investors that there is still progress in capabilities going at high pace, and gaming the benchmarks makes perfect sense in this context. 25% was surprising and appeared to surpass expectations, which is exactly what OpenAI needs.

This concludes with pure speculation presented as fact, combined with a conspiracy theory that lacks any actual evidence. It also displays a shallow understanding of how technical due diligence works in major AI investments - investors at this level typically have deep technical expertise, access to extensive testing and validation, and most damningly, given the reductive appeal to incentive structure:

They closed the big round weeks before.

The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

link

BeefWellington 510 days ago

> The "everybody does it" argument is a classic rationalization that doesn't actually justify anything.

I'd argue here the more relevant point is "these specific people have been shown to have done it before."

> The whole comment reads like someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field. The dismissive tone and casual accusations of misconduct don't help their credibility either.

I think what you're missing is the observation that so very little of that is actually applied in this case. "AI" here is not being treated as an actual science would be. The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.

link

refulgentis 510 days ago

> I'd argue here the more relevant point is "these specific people have been shown to have done it before."

This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").

> "AI" here is not being treated as an actual science would be.

There's some truth here but also some sleight of hand. Yes, AI development often moves outside traditional academic channels. But, you imply this automatically means less rigor, which doesn't follow. Many industry labs have internal review processes, replication requirements, and validation procedures that can be as or more stringent than academic peer review. The fact that something isn't in Nature doesn't automatically make it less rigorous.

> The majority of the papers pumped out of these places are not real concrete research, not submitted to journals, and not peer reviewed works.

This combines three questionable implications:

- That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)

- That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)

- That volume ("pumped out") correlates with quality

You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.

This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:

- Their own employees.

- Other researchers who will try to replicate results.

- Partners integrating their technology.

- Investors doing technical due diligence.

- Regulators scrutinizing their claims.

The idea that they would casually risk all that just to bump up one benchmark number (but not too much! just from 10% to 35%) doesn't align with the actual incentive structure these organizations face.

Both the original comment and this fall into the same trap - mistaking cynicism for sophistication while actually displaying a somewhat superficial understanding of how modern AI research and development actually operates.

link

BeefWellington 510 days ago

This reply reads as though it were AI generated.

Let's bite though, and hope that unhelpful excessively long-winded replies are just your quirk.

> This is itself a slippery move. A vague gesture at past misconduct without actually specifying any incidents. If there's a clear pattern of documented benchmark manipulation, name it. Which benchmarks? When? What was the evidence? Without specifics, this is just trading one form of handwaving ("everyone does it") for another ("they did it before").

Ok, provide specifics yourself then. Someone replied and pointed out that they have every incentive to cheat, and your response was:

> This starts with a fallacious appeal to cynicism combined with an unsubstantiated claim about widespread misconduct. The "everybody does it" argument is a classic rationalization that doesn't actually justify anything. It also misunderstands the reputational and technical stakes - major labs face intense scrutiny of their methods and results, and there's plenty of incestuous movement between labs and plenty of leaks.

Respond to the content of the argument -- be specific. WHY is OpenAI not incentivized to cheat on this benchmark? Why is a once-nonprofit which turned from releasing open and transparent models to a closed model and begun raking in tens of billions of investor cash not incentivized to continue to make those investors happy? Be specific. Because there's a clear pattern of corporate behaviour at OpenAI and associated entities which suggests your take is not, in fact, the simpler viewpoint.

> This combines three questionable implications: > - That non-journal publications are automatically "not real concrete research" (tell that to physics/math arXiv)

Yes, arXiv will host lots of stuff that isn't real concrete research. They've hosted April Fool's jokes, for example.[1]

> - That peer review is binary - either traditional journal review or nothing (ignoring internal review processes, community peer review, public replications)

This is a poor/incorrect reading of the language. You have inferred meaning that does not exist. If citations are so important here, cite a few dozen that are peer reviewed out of the hundreds.

> - That volume ("pumped out") correlates with quality

Incorrect reading again. Volume here correlates with marketing and hype. It could have an effect on quality but that wasn't the purpose behind the language.

> You're making a valid critique of AI's departure from traditional academic structures, but then making an unjustified leap to assuming this means no rigor at all. It's like saying because a restaurant isn't Michelin-starred, it must have no food safety standards.

Why is that unjustified? It's no different than any of the science background people who have fallen into flat earther beliefs. They may understand the methods but if they are not tested with rigor and have abandoned scientific principles they do not get to keep pretending it's as valid as actual science.

> This also ignores the massive reputational and financial stakes that create strong incentives for internal rigor. Major labs have to maintain credibility with:

FWIW, this regurgitated talking point is what makes me believe this is an LLM-generated reply. OpenAI is not a major research lab. They appear to essentially to be trading off the names of more respected institutions and mathematicians who came up with FrontierMath. The credibility damage here can be done by a single person sharing data with OpenAI, unbeknownst to individual participants.

Separately, even under correct conditions it's not as if there are not all manner of problems in science in terms of ethical review. See for example, [2].

[1] https://arxiv.org/abs/2003.13879 - FWIW, I'm not against scientists having fun, but it should be understood that arXiv is basically three steps above HN or reddit. [2] https://lore.kernel.org/linux-nfs/YH+zwQgBBGUJdiVK@unreal/ + related HN discussion: https://news.ycombinator.com/item?id=26887670

link

refulgentis 510 days ago

First paragraph is unnecessarily personal.

It's also confusing: Did you think it was AI because of the "regurgitated talking point", as you say later, or because it was a "unhelpful excessively long-winded repl[y]"?

I'll take the whole thing as an intemperate moment, and what was intended to be communicated was "I'd love to argue about this more, but can you cut down reply length?"

> Ok, provide specifics yourself then.

Pointing out "Everyone does $X" is fallacious does not imply you have to prove no one has any incentive to do $X. There's plenty of things you have an incentive to do that I trust you won't do. :)

> If citations are so important here, cite a few dozen that are peer reviewed out of the hundreds.

Sure.

I got lost a bit, though, of what?

Are you asking for a set of journal articles, that are peer-reviewed, about AI, that aren't on arxiv?

> Why is that unjustified?

"$X doesn't follow traditional academic structures" does not imply "$X has no rigor at all"

> OpenAI is not a major research lab.

Eep.

> "all manner of problems in science in terms of ethical review. "

Yup!

The last 2 on my part are short because I'm not sure how to reply to "entity $A has short-term incentive to do thing $X, and entity $A is part of large group $B that sometimes does thing $X". We don't disagree there! I'm just applying symbolic logic to the rest. Ex. when I say "$X does not imply $Y" has a very definite field-specific meaning.

It's fine to feel the way you do. It takes a rigorously rational process to end up making my argument, but rigorously is too kind: it would be crippling in daily life.

A clear warning sign, for me, setting aside the personal attack opening, would have been when I was doing things like "arXiv has April Fool's Jokes!" -- I like to think I would have taken a step back after noticing it was "OpenAI is distantly related to group $X, a member of group $X did $Y, therefore let's assume OpenAI did $Y and conversate from there"

link

EvgeniyZh 510 days ago

> an unsubstantiated claim about widespread misconduct.

I can't prove it, but I heard it from multiple people in the industry. High contamination levels for existing benchmarks, though [1,2]. Whether to believe that it is just as good as we can do, not doing the best possible decontamination, or done on purpose is up to you.

> Yes, validation and test sets serve different purposes - that's precisely why reputable labs maintain strict separations between them.

The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement. Clearly, OpenAI did not plan to use the provided evaluation as a testset, because then they wouldn't need access to it. Also, reporting validation numbers as performance metric is not unheard of.

> This reveals a fundamental misunderstanding of why math capabilities matter. They're not primarily about serving math users - they're a key metric for abstract reasoning and systematic problem-solving abilities.

How good of a proxy is it? There is some correlation, but can you say something quantitative? Do you think you can predict which models perform better on math benchmarks based on interaction with them? Especially for a benchmark you have no access to and can't solve by yourself? If the answer is no, the number is more or less meaningless by itself, which means it would be very hard to catch somebody giving you incorrect numbers.

> someone who has picked up some ML terminology but lacks fundamental understanding of how research evaluation, technical accountability, and institutional incentives actually work in the field

My credentials are in my profile, not that I think they should matter. However, I do have experience specifically in deep learning research and evaluation of LLMs.

[1] https://aclanthology.org/2024.naacl-long.482/ [2] https://arxiv.org/abs/2412.15194

link

refulgentis 510 days ago

> "I can't prove it, but I heard it from multiple people in the industry"

The cited papers demonstrate that benchmark contamination exists as a general technical challenge, but are being misappropriated to support a much stronger claim about intentional misconduct by a specific actor. This is a textbook example of expanding evidence far, far, beyond its scope.

> "The verbal agreement promised not to train on the evaluation set. Using it as a validation set would not violate this agreement."

This argument reveals a concerning misunderstanding of research ethics. Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training") suggests a framework where anything not explicitly forbidden is acceptable. This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

> "How good of a proxy is it? [...] If the answer is no, the number is more or less meaningless by itself"

This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

> "My credentials are in my profile, not that I think they should matter."

The attempted simultaneous appeal to and dismissal of credentials is an interesting mirror of the claims as a whole: at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

When challenged, it retreats to increasingly abstract hypotheticals about what "could" happen rather than what evidence shows did happen.

This demonstrates how seemingly technical arguments can fail basic principles of evidence and logic, while maintaining surface-level plausibility through domain-specific terminology. This kind of reasoning would not pass basic scrutiny in any rigorous research context.

link

EvgeniyZh 510 days ago

> Attempting to justify potential misconduct through semantic technicalities ("well, validation isn't technically training")

Validation is not training, period. I'll ask again: what is the possible goal of accessing the evaluation set if you don't plan to use it for anything except the final evaluation, which is what the test set is used for? Either they just asked for access without any intent to use the provided data in any way except for final evaluation, which can be done without access, or they did somehow utilize the provided data, whether by training on it (which they verbally promised not to), using it as a validation set, using it to create a similar training set, or something else.

> This directly contradicts established principles of scientific integrity where the spirit of agreements matters as much as their letter.

OpenAI is not doing science; they are doing business.

> This represents a stark logical reversal. The initial argument assumed benchmark manipulation would be meaningful enough to influence investors and industry perception. Now, when challenged, the same metrics are suddenly "meaningless." This is fundamentally inconsistent - either the metrics matter (in which case manipulation would be serious misconduct) or they don't (in which case there's no incentive to manipulate them).

The metrics matter to people, but this doesn't mean people can meaningfully predict the model's performance using them. If I were trying to describe each of your arguments as some demagogue technique (you're going to call it ad hominem or something, probably), then I'd say it's a false dichotomy: it can, in fact, be impossible to use metrics to predict performance precisely enough and for people to care about metrics simultaneously.

> The attempted simultaneous appeal to and dismissal of credentials

I'm not appealing to credentials. Based on what I wrote, you made a wrong guess about my credentials, and I pointed out your mistake.

> at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

Your position, on the other hand, rests on the assumption that corporations behave ethically and with integrity beyond what is required by the law (and, specifically, their contracts with other entities).

link

achierius 506 days ago

> Validation is not training, period.

Sure, but what we care about isn't the semantics of the words, its the effects of what they're doing. Iterated validation plus humans doing hyperparameter tuning will go a long way towards making a model fit the data, even if you never technically run backprop with the validation set as input.

> OpenAI is not doing science; they are doing business.

Are you implying these are orthogonal? OpenAI is a business centered on an ML research lab, which does research, and which people in the research community have generally come to respect.

> at this point, the argument OpenAI did something rests on unfalsifiable claims about the industry as a whole, claiming insider knowledge, while avoiding any verifiable evidence.

No, it doesn't. What OP is doing is critiquing OpenAI for their misbehavior. This is one of the few levers we (who do not have ownership or a seat on their board) have to actually influence their future decisionmaking -- well-reasoned critiques can convince people here (including some people who decide whether their company uses ChatGPT vs. Gemini vs. Claude vs. ...) that ChatGPT is not as good as benchmarks might claim, which in effect makes it more expensive for OpenAI to condone this kind of misbehavior going forward.

The argument that "no companies are moral, so critiquing them is pointless" is just an indirect way of running cover for those same immoral companies.

link