Hacker News new | ask | show | jobs
by dimal 814 days ago
The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.

The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

9 comments

> Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.

In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves
12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.
Except this is automated, so you could get multiples orders of magnitude more bug filled, so you need to have a very low false positive ratio to avoid being overwhelmed by automatically generated crap (which is basically spam).
Have the LLM rewrite the bug reports.
You'd want three LLMs, one to create the bugs, one to report it, one to fix it. I joke of course but on the other hand this is potentially a worthwhile architecture from a self-training perspective - a bug-creating LLM means your training set size is as big as you want it +/- GAN features.
Why not have LLM write AGI while you're at it
It is and it will!
It fixes 12% of their benchmark suite, not 12% of bug reports.
I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.
To be fair, sometimes meticulous users investigate the bugs and write down logical chains explaining the causes and even offer a solution at the end (which they can't apply for the lack of commit access, for instance).

The proposed solution isn't always right, of course, but it would be incorrect to say that no bug reports come with a diagnosed cause. But that's exactly where a conscious reviewer is most needed, I believe.

I sometimes write a detailed bug reports but not a PR when there are different ways to address the problem (and all look bad to me) or the fix can introduce new problems. But I would expect LLM to ignore tradeoffs and choose an option which not necessarily the best for the same reason I hesitate - luck of understanding of this specific project.
It appears that they’re using the PRs from the top5000 most popular PyPi packages for their bench: https://github.com/princeton-nlp/SWE-bench/tree/main/swebenc...
Maybe it would be better if the agent would help people submit better reports instead of trying to fix it. E.g. it could ask them to add missing information, test different combinations of inputs, etc. I could also learn which maintainer to ping according to the type of issue.
Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.

While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.
I was thinking in the context of an open source project, where the users are hopefully converting to productive community members. If it is, like, a job, with a customer service relationship, where they are paying to be able to just throw problems at you and you have to deal with fixing them, I’m sure this wouldn’t fly, so I agree there. (I think my brain short-circuited to open source because it is on GitHub, haha, but of course there’s no reason this couldn’t be used in a proprietary setting).

I’m not sure how it would work out in the case of a free, community driven project, though. The goal isn’t to serve users, it is to convert users into helpful community members. If the bot converts people who wouldn’t otherwise be converted, it seems like a win. If it chases away users who could have been converted with human intervention, that’s a lose. But the human community members can always jump into the thread as well… if the bot is filtering out lots of people and nobody from the community is intervening, I guess that tells us something about the priorities of the community, haha.

I think that depends on the exact KPI.
KPI stands for key performance indicator. It is a tool to grade people or teams by applying numbers to their work.

The only relationship you can have between these is that a ticket with a "resolved" status can be used as a KPI, but you're trying to invert the relationship here, which doesn't work. After all, it's an indicator and not a causal relationship

"ratio of open/total issues" can definitely be gamed by autoclosing anything that isn't an easy fix.

"average time to resolution" is also susceptible.

Both of these are pretty common all over the place, including OSS e.g. https://isitmaintained.com/#metrics

I suspect this sort of thing is one of the major motivations for the (as a user/reporter) infuriating rise in automated "this bug hasn't been touched in NN days, autoclosing for staleness" bots on various issue trackers.

This whole “worrying about KPI’s for my free, open source, community project” thing seems weird to me. (Not to say I don’t believe you, but I don’t understand why people want to inject this annoying mini-game into their hobby).

I’m not sure what to think about the auto-close bots. Which do you think would be more annoying as the person who made the report: having a report that just sits there forever and you just have to hope somebody decided to pick it up, or having the issue auto-closed? (I’m truly and honestly not sure). At least in the case of the former you have a clear marker for when you should try again. But getting rejected by a bot can definitely be annoying.

Oh, so you've experienced those "stale" bots on GitHub. Good times.
I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.

Agreed. I have never encountered a simple math bug in the wild.

To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.

Exactly. This is not perfect and doesn't fix every report so it is useless.
On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.

That’s not what I said and you know it. I’m not saying LLMs are useless. I’m not even saying this tool is useless. I’m saying I’m not impressed with this tool, at least as represented in the demo.
If the bug report needs to be of a certain quality to work, they've just invented issue-oriented programming.
The trick is that people would use LLM to write very long and detailed bug reports :p