Hacker News new | ask | show | jobs
AutoCodeRover: Autonomous Program Improvement (github.com)
96 points by mechtaev 807 days ago
11 comments

>As an example, AutoCodeRover successfully fixed issue #32347 of Django.

This bug was fixed three years ago in a one-line change.[0] Presumably the fix was already in the training data.

[0] https://github.com/django/django/pull/13933

I wondered that too, but the fix it produces is not the same.

Another thing that seemed odd is the English style used in the responses (watch the video full screen and you can read it).

My understanding is all of the issues on SWE-bench have at least a corresponding pull request.
The important detail is when the problem was solved. If it was three years ago, then it was likely captured in the training data for the model.
And the other 78% of time it just creates a bunch of noise that someone has to sift through?
Here's a list of all the successful and unsuccessful patches: https://gist.github.com/arp242/0dc5dab0f7cd10e663cfc26866651...

Ideally, it should also include the problem statement, but that's not in their JSON file and can't arsed to continue working on it – it's just a quick script I cooked up.

I find it very hard to judge the quality of most of these patches because I'm not familiar with these projects.

However, looking at the SWE-bench dataset I don't think it's representative of real-world issues, so "22% of real-world GitHub issues" is not really accurate regardless.

The problem statement of each issue is included in each result folder as `problem_statement.txt` (such as: https://github.com/nus-apr/auto-code-rover/blob/main/results...).

The developer patch for each issue is similarly included as `developer_patch.diff`.

What makes you say it's not representative?
SWE-bench Lite is a subset of extremely simple issues from a cherry-picked subset (SWE-bench) of a handful of large (presumably well-run) Python-only projects.

Here are some rules they used to trim down the SWE-bench Lite problems:

* We remove instances with images, external hyperlinks, references to specific commit shas and references to other pull requests or issues.

* We remove instances that have fewer than 40 words in the problem statement.

* We remove instances that edit more than 1 file.

* We remove instances where the gold patch has more than 3 edit hunks (see patch).

See https://www.swebench.com/lite.html

That's... rather limiting.
Look at the data. Does that seem like the average bug report to you?
It would help if you were to provide a specific example or two
You can't demonstrate whether a dataset is representative or not by "an example or two". You need to look at all the data.

And all of this is fine. It's just a benchmark suit and doesn't need to be fully representative. The dataset itself doesn't even claim to be that as far as I can find. All I'm saying that the title wasn't really accurate.

In short, no.

The ArXiv paper mentions the human developer must supply a unit test (which can conceivably be coded with at least the assistance of an AI agent if not autonomously coded, but their experiment relies upon the former kind of unit test) that issues a pass-fail signal. So the 78% of failures are clearly identified, at the cost of implementing TDD for the Issue. The side effects story is punted upon, but I’d still take this over the nothing we have today.

Of course, over a relatively short amount of time using this, I’d expect to experience the 22% (or whatever the real rate is) success rate to drop asymptotically towards zero as the low hanging fruit of the approach are mined out and it becomes kind of like another linter in our CICD pipelines.

The impact of this tooling upon staff skills development will be interesting to say the least.

AutoCodeRover does not require or assume a unit test to generate patches. The results discussed in Section 6.1 of the ArXiv paper are generated without any unit test. The unit tests are used by SWE-bench, when evaluating the correctness of AutoCodeRover-generated patches.

That being said, when some unit tests are available (either written by developers or with assistance from other tools), AutoCodeRover can make use of them to perform some analysis like Spectrum-based Fault Localization (SBFL). This kind of analysis output can help the agent in pinpointing locations to fix. (Please see Section 6.2 for the analysis on SBFL.)

> AutoCodeRover does not require or assume a unit test to generate patches.

You have this backwards : it's traditional (at least in the past 15 years or so) to have a test to go along with every code change. The idea is that the test proves a) the bug existed prior to the fix and b) the bug is not there after the fix is applied. Commenters here are noting that ACR generates fixes but not tests.

The previous comment was to describe the experiment settings. AutoCodeRover currently generates patches. Auto-generating high quality tests can be a parallel effort and another direction to explore. These efforts can eventually be used together.
The point is that a patch without a test is not generally a useful thing. How do we know the AI generated patches are valid?
The short answer is that unit tests are not needed in AutoCodeRover. The technique proceeds by a sophisticated code search starting from the Github issue. tests are not needed. The code search helps in setting the context for LLM agents - which can help in the patch construction.

If tests are available, they can give additional help in setting code context. But tests are not needed, and most of Github issues are solved without tests.

All experimental numbers appear in the arxiv paper. Please let us know if you have more questions.

> tests are not needed

Strong words!

Yes, and to be clear, the benchmark used here is merely the 300 simplest problems in the larger benchmark suite, which itself is only a tiny subset of issues from a dozen large (and presumably well-curated) Python projects.

Not to mention that making the code fix is only a tiny part of resolving an issue. There should also be explanations and added test cases. In other words, I doubt the 22% of “fixes” would pass review by the project owner if a human submitted them.

That's in my experience better than the percentage of although usually good-intentioned but nevertheless unusable PRs popular repositories get.
The point is that the success rate is progressing, paper after paper

> The baseline results of Magis (10%), Devin (14%) are evaluated in another subset of SWE-bench, which we cannot directly compare with, so we take the results from their technical reports as a reference.

Wondering how it compares with these models.

Why not use AutoCodeRover, Magis, and Devin together for 46%

/s

Just about a week ago open devin got about 13% on this benchmark. Just give it a few more weeks.

edit: apparently it's not the exact same benchmark but a similar one

If it continues at this pace, then it'll solve 108% of GitHub issues in just 3 months
if a ticket is open and AutoCodeRover just says "was unable to find something" its still better to have 22% fixed automatically.
But it doesn’t say that. It submits a patch that doesn’t solve the problem instead.
LLMs are unable to say that they don't know something. They just generate nonsense.
There's actual SWE jobs where humans sift through this kind of noise. Someone told me they worked such a job recently. It's a good tool to add pressure and raise expectations. Maybe this is the future..
They only know the 22% number because unit tests to check for a fix are included in the benchmark. In other words, in a real world situation, the human would still need to double check. The patches this tool generates do not include appropriate tests or explanations and would never pass code review by a qualified human.
I would be interested to see how it performs on end-user software where bug reports are nebulous at best, ridiculous at worst. Furthermore, most of those fixes tend to be upstream bugs and very rarely anything to do with the actual software.
The entire setup is available for inspection, please see

https://github.com/nus-apr/auto-code-rover

if you need example bugs we can provide that too. Some examples also appear in the arxiv paper, please see

https://arxiv.org/abs/2404.05427

This is super fascinating stuff, excellent work. As most of us don't have the time to read the entirety of the paper, are you able to directly link to some issues which have been landed and closed? Some personal favorites would be awesome :)

I think I speak for others when I say the best way to judge the efficacy of this project is some real-world, on-site examples of it being used in prod. I'm especially curious for its performance in feature-request or flakey bug report type issues as opposed to reliable test failures. I expect the former is much tougher!

Thank you for your interest. There are some interesting examples in the SWE-bench-lite benchmark which are resolved by AutoCodeRover:

- From sympy: https://github.com/sympy/sympy/issues/13643. AutoCodeRover's patch for it: https://github.com/nus-apr/auto-code-rover/blob/main/results...

- Another one from scikit-learn: https://github.com/scikit-learn/scikit-learn/issues/13070. AutoCodeRover's patch (https://github.com/nus-apr/auto-code-rover/blob/main/results...) modified a few lines below (compared to the developer patch) and wrote a different comment.

There are more examples in the results directory (https://github.com/nus-apr/auto-code-rover/tree/main/results).

fwiw the example issue highlighted in the post was already fixed by a human 3 years ago so I wouldn't expect to see much in the way of real life fixed issues yet.
Author published a ready-to-use Docker image: https://hub.docker.com/r/yuntongzhang/auto-code-rover/tags
So this works for repositories with decent unit tests.

Which excludes 80+% of real world bug and feature issues, in my experience…

No it does not need an unit test to work. We responded to another similar question from another user.
Then how can you have confidence that it actually fixes the bug? It means you still need a human to review the fix, no?
The developer written testcases are provided in SWE-bench-lite so those could be used to check the generated patches.

The auto-generated patches are to reduce the effort of resolving issues. In practice, they should be reviewed and verified by human developers before they are integrated.

Thank you for the clarification. And shame on me for talking out of my ass!
How well does AutoCodeRover work in relation to compiled languages such as Java, Go, or Rust?

The local code search idea to get around context limits is cool. Have you experimented with Anthropic's models for the larger context limit and dropping the code search?

What's actually going on here? I watched the video of the example problem solution and it looks like either magic or fake. It doesn't produce the same PR as the real bug fix.
The entire setup is available for inspection from

https://github.com/nus-apr/auto-code-rover

Please try it out and send emails to the contact emails in this webpage, if you have any questions.

Ok thanks. I haven't run it yet, but this does tell me that it's using OpenAI.

Is it expected to be able to solve arbitrary (simple) bugs, or only the list of bugs in the benchmark set?

Excited to see how badly the comments here age over the next few months.
did someone here replicate this on their own code?
See post above. It is expected to be runnable any anyone from the git repo contents.
at the time of writing this their repo is 12h old. the training time isn't stated in the paper. i'm thinking maybe one of these robots can replicate this and tell us how it went.
22% is a hilariously low percentage to use as a tagline. I do hope it gets better.
22% less issues for free is bad?
No, but 22% of tackled issues not being resolved correctly hints at how bad it is. Id guess that in these 22%, most of them have bugs or miss edge cases, considering it completely failed to solve 80%.

If someone gets 20% on an exam I don't go "great, thats 20% of the way there!!!", instead I go "you clearly didnt attend, try again next time".

> If someone gets 20% on an exam I don't go "great, thats 20% of the way there!!!", instead I go "you clearly didnt attend, try again next time".

Sure, if little Bobby gets a 20% I’ll whoop his ass, but if the inanimate hunk of metal on my desk gets a 20% I might start to take notice.

Sure, its sorta cool and promising for technology in the future. But as of now, we don't know how badly it might fuck up the other 78% of cases. If it fucks up like 50% of cases so badly that it takes more time for the devs to fix than it would usually, then it's a liability.
It’s not free.
Fewer.