Hacker News new | ask | show | jobs
by SlinkyOnStairs 63 days ago
> hopefully changes the way benchmarking is done

The purpose of a system is what it does.

AI companies want adcopy, not legitimate benchmarks. Even this very paper will be twisted into a means to that end. "Oooo, AI is exploiting our benchmarks. Scary alignment problem!!!one! Our AI is so good we can't contain it, INVEST NOW!"

4 comments

I work at OpenAI and I really don't find this to be the case.

We're pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users, but we don't do this. My impression is that Anthropic and other labs are similar. E.g., in the Sonnet 4.6 system card they use a model to detect potential contamination and manually score those outputs as 0 if human review agrees there was contamination. If all the labs cared about was marketing material, it would be quite easy not to do this extra work.

There are ton of other games you can play with evals too (e.g., test 100 different model checkpoints or run secret prompt optimization to steer away from failing behaviors), but by and large what I've seen inside OpenAI is trustworthy.

I won't say everything is 100% guaranteed bulletproof, as we could always hire 100 more SWEs to improve hack detection systems and manually read outputs. Mistakes do happen, in both directions. Plus there's always going to be a bit of unavoidable multiple model testing bias that's hard to precisely adjust for. Also, there are legitimate gray areas like what to do if your model asks genuinely useful clarifying questions that the original reference implementation scores as 0s, despite there being no instruction that clarifying questions are forbidden. Like, if you tell a model not to ask clarifying questions is that cheating or is that patching the eval to better align it with user value?

> pretty diligent about applying search blocklists, closing hacking loopholes, and reading model outputs to catch unanticipated hacks. If we wanted to, we could choose to close our eyes and plug our ears and report higher scores for Terminal-bench, SWE-bench, etc. that technically comply with the reference implementation but aren't aligned with real value delivered to users

Of course, but that's the difference between sins of commission and sins of omission. The question is what "pretty diligent" actually translates to in practice. How many people will encourage delays in a model release or post-training improvement waiting "for more thorough evaluation"? How many popularized AI results can you vouch for on this?

The zeitgeist is to celebrate bias for action, avoiding analysis paralysis and shipping things (esp. with conference driven research culture, even before we get into thorny questions of market dynamics), so even if we have a few pockets of meticulous excellence, the incentive structure pushes towards making the whole field rot.

I work at runloop and I've spent a considerable amount of time getting various benchmarks to run with very high concurrency (thousands at once). My experience is similar to your own: it takes a ton of time and effort setting up benchmarks to run at scale with protection against reward hacks.

Keeping a benchmark test harness secure and fast is non-trivial. You need to keep the grading script and the solution off the box, use network controls, deal with external resource usage, etc. It's a lot of work. I don't think it's realistic to expect benchmark authors to bullet proof their benchmark runners. Most benchmarks are written to be run conveniently on a single machine (ie. in docker), not to run in parallel across tends of thousands of secure, isolated machines.

I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)

And this is something which has reached the public eye in one of the most anticipated videos basically. So I find it a bit rough as to think that OpenAI has the best practices for data, and if the public can be shown these inaccurate graphs themselves on based on benchmarks. I find it a bit harder to trust the benchmarks themselves and if OpenAI wants legitimate benchmarks.

Also I find it wild that after 1 month of this, nobody talked about it. I remember thinking that this is gonna be the highlight for a long time that a mega billion dollar company did such basic graph errors. I feel like we are all forgetting a lot of things as our news cycle keeps on moving faster.

(Another tangential point is about the OpenAI/Google employees who had signed the pledge yet nothing came out of it and this is something more recent & I also remember one of your comments on Hackernews.)

> I'm an OpenAI employee and I'll go out on a limb with a public comment. I agree AI shouldn't be used for mass surveillance or autonomous weapons. I also think Anthropic has been treated terribly and has acted admirably. My understanding is that the OpenAI deal disallows domestic mass surveillance and autonomous weapons, and that OpenAI is asking for the same terms for other AI companies (so that we can continue competing on the basis of differing services and not differing scruples). Given this understanding, I don't see why I should quit. If it turns out that the deal is being misdescribed or that it won't be enforced, I can see why I should quit, but so far I haven't seen any evidence that's the case. [1]

This is a bit off-topic so sorry about that, but I hope that you realize that you did say you will go out on a limb with public comment so please don't mind if I ask for some questions, everyone supported you then and heck, even I thought that maybe I was wrong and I thought that I should trust you more than my gut-instincts because you clearly must know so much more than me/us but that aged like fine milk.

I would really love some answers or your thoughts now on that off-topic thought as well if possible as these are just some questions which are unanswered by you and I would love to have a respectful discussion about it, sorry for catching you off guard, waiting for your reply and I wish you to have a nice day ted.

[0]: https://www.reddit.com/r/BetterOffline/comments/1mk6ofz/gpt5...

[1]: https://news.ycombinator.com/item?id=47191196

> I remember the gpt-5 benchmarks and how wildly inaccurate they were data-wise. Linking one[0] that I found so that other people can remember what I am talking about. I remember some data being completely misleading or some reaching more than 100% (iirc)

Yeah, I found that slide very embarrassing. It wasn't intentionally inaccurate or misleading - just a design error made right before we went live. All the numbers on that slide were correct, and there was no problem in terms of research accuracy or data handling or reward hacking. A single bar height had the wrong value, set to its neighbor. Back then, we in the research team would generate data and graphs, and then hand them off to a separate design team, who remade the graphs in our brand style. After the GPT-5 launch with multiple embarrassingly bad graphs, I wrote an internal library so that researchers could generate graphs in our brand style directly, without the handoff. Since then our graphs have been much better.

I don't think it's unfair to assume our sloppiness in graphs translates to sloppiness in eval results. But they are different groups of people working on different timelines, so I hope it's at least plausible that our numbers are pretty honest, even if our design process occasionally results in sloppy graphs.

Regarding the DoW deal, I don't want to comment too publicly. I also can't say anything with confidence, as I wasn't part of the deal in any way shape or form. My perception from what I have read and heard is that both Anthropic and OpenAI have good intentions, both have loosened their prior policies over time to allow usage by the US military, and both have red lines to prohibit abuse by the US military. One place they differ is in the mechanisms employed to enforce those red lines (e.g. usage policies vs refusals vs human oversight). Each company asserts their methods are stronger than the other's, so I think we have to make our own judgments there. Accounts from the parties involved in the negotiations also conflict, so I don't think anyone's account can be trusted 100%. With that caveat, I thought this article on the DoW's POV was interesting (seems to support the notion that the breakdown wasn't over differing red lines, especially since they almost managed to salvage the deal): https://www.piratewires.com/p/inside-pentagon-anthropic-deal...

Lastly, I hope it's obvious to everyone that Anthropic is not at all a supply chain risk and the threats there were incredibly disappointing. I support them 100% and I'm glad to see them unhurt by the empty threats.

This is what makes HN great: We get to hear from the people and not (only) the media dept. Thanks for your honesty and openness. I trust OpenAI a lot more when I hear balanced accounts like this.
Thank you for the transparency and insights! Very helpful.

We actually did the same thing re generating charts in brand style to avoid any mishaps, since then I sleep much better

>The purpose of a system is what it does.

I am so tired of this saying.

It's not true, in general. Systems almost universally have unintended consequences and result in side effects their designers did not foresee.

Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench.

https://en.wikipedia.org/wiki/The_purpose_of_a_system_is_wha...

You are misunderstanding the saying. It is entirely about unintended consequences and viewing the system for what it actually does and not any stated intentions of the designers.

I will propose that you are wrong.

1. We must ignore the intentions of the designers (your claim), and instead see what the outcomes are

2. Therefore we should ignore Beer's intentions when designing the phrase POSWID, and instead see how it is used.

3. The overwhelming majority of people using it on the internet (including the GP comment) is to imply that the people perpetuating the system actually desire the outcome.

So the purpose of POSWID is clearly to imply intent.

Whose intent? POSWID Is about structural incentives not personal intent, and these can be, and likely are, an emergent behavior. It’s about reframing away from intents, treating the system as a structure and removing the whole structure for replacement. As opposed to localized reforms which are exposed to the same prior emergent behaviors leading to constant backsliding.
> Whose intent?

The intent of those creating or perpetuating a system.

There are plenty of cases where you absolutely can/should discuss outcomes in a way where the intention is not factored in because it can often be straight up irrelevant.

If a gun is developed with the intention of hunting only bears and someone uses it to shoot people, you don’t have to constantly preface things by talking about how it’s supposed to be used only on bears. Sometimes that fact, depending on the context of the conversation, is simply not relevant.

To cover my bases here: yes it often is relevant and maybe even critical info, but it often isn’t either of those things.

I agree with the idea that intent is often irrelevant. I disagree that POSIWID is a good way to communicate that idea.
Well that’s stupid and completely ignores the meaning of the word “purpose”.
It does not ignore the word. It subverts it, and that's the point. It's the system equivalent of "death of the author", which states that omes a work is written, the authors intent loses relevance and the work must be examined on its own. The aurhors opinion or relationship to the work carries no more weight than any other persons.

That's not "true" in any demonstrable sense, but it can be a useful form of analysis. As it is with "purpose of a system"

This is not how people outside of cybernetics use POSWID. From context it does not appear to be how SlinkyOnStairs was using it either.

I think it's also trying to be too cute. The first two definitions of purpose on Wiktionary[A]:

1. The end for which something is done, is made or exists.

2. Function, role.

People (uselessly) talking about the purpose of a system are often referring to #1, while POSWID is using it to mean #2. The real point of POSWID is that only definition #2 matters. POSWID is a terrible phrase not because it is wrong, but because is is an equivocation -- I suspect that Beer intended it as a pun, but the difference between the two is if one gets the joke. POSWID gets used incorrectly because people don't get the joke.

A: https://en.wiktionary.org/wiki/purpose

> From context it does not appear to be how SlinkyOnStairs was using it either.

The exact definition of "purpose" doesn't matter much here.

The particular version of the heuristic used here is that the stated purpose and the actual purpose often differ. POSIWID being the observation that the actual purpose is reflected by the outcomes of the system, because if that isn't the case the system gets changed.

Thus, the observation about AI benchmarks. AI companies have had years now to stop using unreliable benchmarks as advertising material. There's been years of piece after piece about the problems with these benchmarks. And yet the AI marketing continues as is.

I'd go further and say this is also the cybernetics equivalent of the religious teachings about humans, specifically the whole "judge by one's deeds, not by one's words" thing. So it's not like it's a novel idea.

Also worth remembering that most systems POSIWID is said about, and in fact ~all important systems affecting people, are not designed in the first place. Market forces, social, political, even organizational dynamics, are not designed top-down, they're emergent, and bottom-up wishes and intentions do not necessarily carry over to the system at large.

If you accept what the system actually does now, and decides to live with it as it is, you just deprecated the original "purpose" and made it irrelevant. You embraced "the purpose is what it does" - to you.

IMHO the saying is meant to make you reflect.

I think the point is that if the side effects become known and are accepted, or if they are known and rejected, then indeed the purpose of the system is what it does.
> Designing benchmarks resistant to adversarial attempts to exploit the benchmark software is just something no one was thinking about when they created SWE-bench

That seems like a major oversight. "AI does whatever maximizes reward/minimizes loss, not what you actually want" is one of the biggest challenges in ML in the last two decades (relevant here because researchers selecting architectures and training regimens that maximize public benchmarks are just a bigger training loop with those benchmarks as reward function). And the analogous issue post-training in AGI-like systems is well studied as the alignment problem, the core issue of classical AI safety

If cheating the benchmark is easier than passing it, you expect the cheating strategy to emerge and win. (Just like you would with humans btw)

I think the point of the saying is that as systems tend to expand, sooner or later we become part of them. That means that we can no longer see them from outside, we're now part of the system and our goals and the system's goals will align. Then the purpose of the system can't be anything else than what it does.
Same. Anyone who has designed anything at all in any domain realizes that what your intentions are and what materializes are often not the same. You have practical constraints in the real world. That doesn’t somehow make the constraints the purpose. The saying makes no sense.
In true HN fashion, you’re an engineer that somehow thinks that they should just form opinions through your divine intuition instead of actually reading the source material, which you very clearly haven’t done.

You’d think that for you to become “so sick of” a saying, you might actually at some point read up on what it means.

> AI companies want adcopy, not legitimate benchmarks.

Labs need accurate benchmark measurements, at least internally, to figure out what model improvements actually matter.

Having models exploit benchmarks serves no purpose. If they wanted to make their models look better than they are, they could just make the data up.

That is Anthropic’s shtick to a tee.