|
Some of the examples in the paper seem to be wrong. For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI. For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie. Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors. |
According to the paper:
> 1. Solution leak: represents instances where the solution to the issue is clearly outlined in the issue description or comments on GitHub. Since both the issue descriptions and comments (referred to as hints_text in the SWE-Bench study) are provided as input to the models, these LLM models can extract the solutions directly from this information instead of generating it independently.
And yet, the SWE-Bench authors themselves explicitly state:
> In short, for participating on the SWE-bench leaderboard, using hints_text in any manner is not allowed. Although we don't explicitly say this in the original paper, we also do not make any mention of using the hints_text anywhere.
So, it's a made up issue that would only occur if you deviated from the paper implementation and explicitly added a field called "hints" that isn't used anywhere.