| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nyrikki 818 days ago

To add to this.

I was going through devin's 'pass' diffs from SWE bench.

Every one I ended up tracing to actual issues caused changes that would reduce maintainablity or introduced potential side effects.

I think it may be useful as a suggestion in a red-green-refactor model, but will end up producing hard to maintain and modify code.

Note this one here that introduced circular dependencies, changed a function that only accepted points to one that appears to accept any geometric object but only added lines.

Domain knowledge and writing maintainable code is beyond generative transformers.

https://github.com/CognitionAI/devin-swebench-results/blob/m...

You simply can't get past what Gödel and Rice proved with current technology.

It is like when visual languages were supposed to replace programmers. Code isn't really the issue, the details are.

2 comments

ekidd 818 days ago

Thank you for reading the diffs and reporting on them.

And to be fair, lots of humans are already at least this bad at writing code. And lots of companies are happy with garbage code so long as it addresses an immediate business requirement.

So Devin wouldn't have to advance much to be competitive in certain simple situations where people don't care about anything that happens more than 2 quarters into the future.

I also agree that producing good code which meets real business needs is a hard problem. In fact, any AI which can truly do the work of a good senior software engineer can probably learn to do a lot of other human jobs as well.

link

nyrikki 818 days ago

Architectural erosion is an ongoing problem for humans, but they don't produce tightly coupled low cohesion code by default at the SWE level the majority of the time.

With this quality of changes it won't be long until violations stack up to where further changes will be beyond any algorithms ability to unravel.

While lots of companies do only look out in the short term, human programers are incentivized to protect themselves from pain if they aren't forced into unrealistic delivery times.

At&t wireless being destroyed as a company due to a failed SAP migration that was largely due to fragile code is a good example.

But I guess if the developer jobs that will go away are from companies that want to underperform in the market due to errors and a code base that can't adapt to changing market realities, that may happen.

But I would fire any non intern programmer if they constantly did things like removing deprecation comments and introduced circular dependencies with the majority of their commits.

https://github.com/CognitionAI/devin-swebench-results/blob/m...

PAC learning is powerful but is still probably approximately correct.

Until these tools can avoid the most basic bad practices I don't see any company sticking to them in the long term, but it will probably be a very expensive experiment for many of them.

link

falcor84 818 days ago

Can't we just RLHF code reviews?

link

nyrikki 818 days ago

RLHF works on problems that are difficult to specify yet easy to judge.

While RLHF will help improve systems, code correctness is not easy to judge outside of the simplest cases.

Note how on OpenAI's technical report, they admit performance on college level tests is almost exclusively from pre-training. If you look at LSAT as an example, all those questions were probably in the corpus.

https://arxiv.org/abs/2303.08774

link

falcor84 818 days ago

>RLHF works on problems that are difficult to specify yet easy to judge.

But that's the thing, that it seems that everyone here on HN (and elsewhere) finds it easy to judge the flaws of AI-generated code, and they seem relatively consistent. So if we start offering these critiques as RLHF at scale, we should be able to bring the LLM output to the level where further feedback is hard (or at least inconsistent), right?

link

ogogmad 818 days ago

> You simply can't get past what Gödel and Rice proved with current technology.

Not this again. Those theorems tell you nothing about your concerns. The worst case of a problem is not equal to its usual case.

link