|
|
|
|
|
by anotherpaulg
814 days ago
|
|
Very cool project! I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail. It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged? Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score? I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task. |
|