Hacker News new | ask | show | jobs
by comex 479 days ago
Some of the examples in the paper seem to be wrong.

For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.

For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.

Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.

6 comments

The entire premise of this paper is false. They claim that the "hints_text" is used and leaks the answer in Section 2.1.1; however, the authors of SWE-Bench themselves state that this is not used anywhere (Issue #133 on the official SWE-Bench GitHub).

According to the paper:

> 1. Solution leak: represents instances where the solution to the issue is clearly outlined in the issue description or comments on GitHub. Since both the issue descriptions and comments (referred to as hints_text in the SWE-Bench study) are provided as input to the models, these LLM models can extract the solutions directly from this information instead of generating it independently.

And yet, the SWE-Bench authors themselves explicitly state:

> In short, for participating on the SWE-bench leaderboard, using hints_text in any manner is not allowed. Although we don't explicitly say this in the original paper, we also do not make any mention of using the hints_text anywhere.

So, it's a made up issue that would only occur if you deviated from the paper implementation and explicitly added a field called "hints" that isn't used anywhere.

Hmm. For the example they give of solution leakage, sympy issue 16669 aka sympy__sympy-16766[1], the solution actually appears in problem_statement, so it seems to be genuine leakage. But you're right that they claim that hints_text is used, so they may have improperly winnowed out other instances where the solution only appears in hints_text.

[1] Don't ask me why they cited the issue number, 16669, instead of the pull request number, 16766, when only the latter appears in the dataset. This confused me for a bit.

> For django-32517

Although I agree with your analysis and it doesn't look great for the authors, this issue (https://code.djangoproject.com/ticket/32517) arguably falls into their "Solution leak" category anyways, as the following text appears in the issue description (and so I think directly in `problem_statement` rather than `hints_text`):

> Currently, OrderedSet isn't reversible (i.e. allowed to be passed as an argument to Python's reversed()). This would be natural to support given that OrderedSet is ordered. This should be straightforward to add by adding a __reversed__() method to OrderedSet.

It isn't the exact code though, so I suppose it could be argued instead that the issue is just extremely easy.

Interesting analysis! I hadn't dug into the specific patch details like that. It's a good reminder that "correctness" isn't always the only dimension to evaluate these AI-generated patches – readability and idiomatic style definitely matter too, even if the functional outcome is the same.

I've been playing around with some automated code review tools recently, and it's surprising how often they flag things that are technically correct but just... unusual. Style matters, especially for maintainability.

I can only confirm two mistakes in the apper: 1) As you say, the reversed(self.dict) is actually correct; 2) as another poster below said, hints are not part of the input. These two mistakes are so egregious given the objective of the paper that I'm convinced the authors are not qualified to write it.

IMHO, it is probably better to discard this paper, and wait for someone else to cover this important topic.

I think you should. Looks like there is more work to do
The paper should be then retracted.