| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hedora 874 days ago

The study shows a doubling in the rate at which tech debt code is produced and checked into the repo.

Anecdotally, as a principal engineer, I’ve definitely noticed that new senior engineers on the team that say they are using chatgpt/copilot produce unprecedentedly bad code at unprecedented rates.

It takes me 2-3x longer to unwind such crap than it would for me to write it from scratch.

As we grow the team, this will definitely put us out of business unless we find a way to fix it.

Currently, we’re hoping the AI assisted engineers will get better at unborking code before merging it, but that’s a harder task than RTFM or going to stack overflow to copy-paste.

2 comments

jerf 874 days ago

I don't have great English words for this, but the biggest concern for me with LLMs is that of all the text generation algorithms I've ever seen, they are just fantastic at producing output whose plausibility to the human mind greatly exceeds its actual quality, the difficulty of concretely measuring either of those values notwithstanding.

Note I'm not even strictly speaking criticizing the quality of the output per se. It is also a big jump over any previous technology and very impressive in its own way.

It is, nevertheless, quite dangerous because the jump in the human-perceived plausibility is much larger than the quality improvement.

Whereas earlier techs were obviously wrong to a human reader, in the case of code generation so obviously wrong that we never even considered using them, LLMs are extremely good at hiding the errors in the parts of the code that we are cognitively most inclined to overlook. This also has the effect of making it bizarrely difficult code to fix.

How it does this I do not know. A fascinating research question for some ambitious cognitive scientist. But the signal is very strong and I don't need to wait for a paper to come out to see it.

I do not think this is fundamental to AI. As I like to remind people, LLMs are not the whole of AI. They're just one technique, and one that partially for the very reason I discuss in this post, one I expect to eventually become a part of a larger system that can fix this problem at some higher level. I expect people to someday look back and laugh at us for thinking that LLMs could be used for all the things we think they can be used for. But the reasons they will be laughing are the very experience we're gathering now, and there's no skipping that phase.

link

qrian 866 days ago

I have heard 'bullshitting' as a term for this. Be underhandedly deceptive that is. But I do have to say a lot of humans also bullshit throught their works and get away with it, so I don't know if this problems is fixable.

link

tomrod 866 days ago

https://en.wikipedia.org/wiki/Chinese_room

link

rabuse 874 days ago

I was going to ask, why aren't they running the code through unit tests if they're committing such shit code? Does it pass the tests, or is it just inefficient code?

link

dathinab 874 days ago

First consider following: The problem with current LLMS they generate code which in addition to any obvious messiness have the tendency to look plausible even if wrong. I.e. the kind of bugs which are overlooked in reviews the most are also the kinds of bugs LLMS are most likely to have "accidentally learned as correct behaviour".

Then: Outside of some extra complex or absurdly simple case it is very often harder to write tests which truly and fully test your code then it is to write the code correct.

In my experience often correct code is a product of carefully written test (which still are in reality imperfect), static code analysis (can be the type system, or external tools) applied to carefully written code and a proper code review.

So if you bring both of it together you have:

- AI supported code which is likely to contain bugs which are really easy to overlook in reviews

- AI supported test code which is has the same issue, i.e. they have gaps which are really likely to overlook by reviewers.

- more code due to less reuse and it also sometimes being easier to generate instead of use a library leading to more code review needing to be done and in turn more time pressure and less quality review

so put together: more bugs which are hard to find with test which are more likely subtle pass even with bugs and less time for proper reviews

So does it pass the test? Yes, but it was AI written too so can it be trusted?

link

actionfromafar 874 days ago

tests?

link