| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arp242 807 days ago

Here's a list of all the successful and unsuccessful patches: https://gist.github.com/arp242/0dc5dab0f7cd10e663cfc26866651...

Ideally, it should also include the problem statement, but that's not in their JSON file and can't arsed to continue working on it – it's just a quick script I cooked up.

I find it very hard to judge the quality of most of these patches because I'm not familiar with these projects.

However, looking at the SWE-bench dataset I don't think it's representative of real-world issues, so "22% of real-world GitHub issues" is not really accurate regardless.

2 comments

yuntong 807 days ago

The problem statement of each issue is included in each result folder as `problem_statement.txt` (such as: https://github.com/nus-apr/auto-code-rover/blob/main/results...).

The developer patch for each issue is similarly included as `developer_patch.diff`.

link

wsdookadr 807 days ago

What makes you say it's not representative?

link

skywhopper 807 days ago

SWE-bench Lite is a subset of extremely simple issues from a cherry-picked subset (SWE-bench) of a handful of large (presumably well-run) Python-only projects.

Here are some rules they used to trim down the SWE-bench Lite problems:

* We remove instances with images, external hyperlinks, references to specific commit shas and references to other pull requests or issues.

* We remove instances that have fewer than 40 words in the problem statement.

* We remove instances that edit more than 1 file.

* We remove instances where the gold patch has more than 3 edit hunks (see patch).

See https://www.swebench.com/lite.html

link

kevindamm 806 days ago

That's... rather limiting.

link

arp242 807 days ago

Look at the data. Does that seem like the average bug report to you?

link

falcor84 807 days ago

It would help if you were to provide a specific example or two

link

arp242 807 days ago

You can't demonstrate whether a dataset is representative or not by "an example or two". You need to look at all the data.

And all of this is fine. It's just a benchmark suit and doesn't need to be fully representative. The dataset itself doesn't even claim to be that as far as I can find. All I'm saying that the title wasn't really accurate.

link