|
|
|
|
|
by ukFxqnLa2sBSBf6
479 days ago
|
|
There’s a few things I’m not understanding here. 1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue? 2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility. 3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo. |
|
I looked at a bunch of issues in the dataset when SWE-verified first game out and I was trying to make scaffolding to solve it and I don't remember a single time where the solution existed verbatim in the issue. I'm not saying it never happens, but it would have to be rare.
> 2. Are the issues locked after they’re included in the dataset?
No one changes the issues in the dataset but of course the original issue on github will have been resolved long ago. The models don't have access to this in their context, but if they were trained on github there's a very real risk that they've seen the solution.
> 3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.
The tests aren't provided to the model, they are run after the model has proposed its final answer.