| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anotherpaulg 814 days ago

Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

2 comments

a_wild_dandan 814 days ago

Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!

link

swatcoder 814 days ago

> My debugging time costs many orders of magnitude more than SWE-agent

Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)

link

nickpsecurity 814 days ago

I totally agree. My solution to this was limiting my AI use to (a) whatever didn't impair creativity and (b) just in general to keep the brain sharp. If using AI regularly, one could just manually solve a percentage of the problems.

link

ein0p 814 days ago

I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively

link

Aperocky 814 days ago

That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).

link

int_19h 814 days ago

Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.

link

senko 814 days ago

If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?

link