| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wokwokwok 572 days ago

This is super big news if it’s real.

Basically, given an agent with an initial set of predefined actions and goal, they’re saying “decompose this into steps and pick and action to achieve each step”. Pretty standard stuff.

Then they say, hey, if you can’t solve the problem with those actions (ie. failed repeatedly when attempting to solve), write some arbitrary generic python code and use that as your action for the next step.

Then save that as a new generic action, and slowly build up a library of actions to augment the initial set.

The thing is, there’s no meaningful difference between the task “write code to solve this task” and “write code to solve this action”; if you can meaningfully generate code that can, without error, perform arbitrary tasks, you’ve basically solved programming.

So… that would be quite a big deal.

That would be a real “Devon” that would actually be able to write arbitrary code to solve arbitrary problems.

…which makes me a bit sceptical.

Still, this seems to have at least worked reasonably well (as shown by being a leader on the GAIA leaderboard) so they seem to have done something that works, but I’m left wondering…

If you’ve figured out how to get an agent to write error free deterministic code to perform arbitrary actions in a chain of thought process, why are you pissing around with worrying about accumulating a library of agent actions?

That’s all entirely irrelevant and unnecessary.

Just generate code for each step.

So… something seems a bit strange around this.

I’d love to see a log of the actual problem / action / code sequences.

2 comments

Kiro 572 days ago

Devin is real. What do you mean?

Anyway, this is pretty standard stuff already. In all my agent workflows the agents are able to write their own code and execute it before passing the result to the next agent. It doesn't need to be perfect since you always have an agent validating the results, sending the task back if necessary.

I haven't read the paper beyond the synopsis so I might be missing a crucial key takeaway and I presume it has a lot of additional layers.

link

wokwokwok 572 days ago

As evidenced by the reaction to Devin, no, it’s not real.

There’s a limit, beyond which agent generated code is, in general, not reliable.

All of the people who claim otherwise (like the Devin videos) have shown to be fake (1) or cherry-picked.

Having agent generated code is arbitrary code to solve arbitrary problems is. Not. A. Solved. Problem.

Yet.

…no matter, no matter how many AI bros claim otherwise, currently.

Being able to decompose complex problems into part small enough to be able to be solved by current models would be a big deal if it was real.

(Because, currently the SoTA can’t reliably do this; this should not be a remotely controversial claim to people familiar with this space)

So tldr; extraordinary claims require extraordinary evidence. Which is absent here, as far as I can tell. They specifically call out in the paper that generated actions are overly specific and don’t always work; but as I said, it’s doing well on the leader board, so it’s clearly doing something, which is working, but there’s just noooooo way of seeing what.

[1] - https://www.zeniteq.com/blog/devins-demo-as-the-first-ai-sof...

link

IanCal 572 days ago

> If you’ve figured out how to get an agent to write error free deterministic code to perform arbitrary actions in a chain of thought process

You don't have to have it perfect, and the more you reuse things that you know work the less you have to build each time (reducing places for errors)

> Just generate code for each step.

We don't do this as humans, we build and reuse pieces.

link