| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ukFxqnLa2sBSBf6 479 days ago

There’s a few things I’m not understanding here.

1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.

3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.

2 comments

sebzim4500 479 days ago

>1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

I looked at a bunch of issues in the dataset when SWE-verified first game out and I was trying to make scaffolding to solve it and I don't remember a single time where the solution existed verbatim in the issue. I'm not saying it never happens, but it would have to be rare.

> 2. Are the issues locked after they’re included in the dataset?

No one changes the issues in the dataset but of course the original issue on github will have been resolved long ago. The models don't have access to this in their context, but if they were trained on github there's a very real risk that they've seen the solution.

> 3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.

The tests aren't provided to the model, they are run after the model has proposed its final answer.

link

jbellis 479 days ago

Especially with swe-verified, I thought that was the whole point of that dataset

link

flakiness 479 days ago

This was also my first thought, but reading [1] again, what they did was labeling like:

> Whether we consider the issue description to be underspecified and hence unfair to be testing on. > Whether the FAIL_TO_PASS unit tests filter out valid solution

and a bit more. This is pointed out in the linked paper too.

The moral of the story to me is that, don't believe the paid human annotator. You can (hopefully) still believe the PhD students doing these unpaid jobs as their research ;-)

[1] https://openai.com/index/introducing-swe-bench-verified/

link