Hacker News new | ask | show | jobs
by dools 604 days ago
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible

I heard a similar thing from a dude when I said I use it for bash scripts instead of copying and pasting things off StackOverflow.

He was a bit "get off my lawny" about the idea of running any code you didn't write, especially bash scripts in a terminal.

It is obviously the case that I didn't write most of the code in the world by a very large margin, but even not taking it to extremes if I'm working on a team and people are writing code how is it any different? Everyone makes mistakes, I make mistakes.

I think it's a bad idea to run things that you don't at least understand what it's going to do but the speed with which ChatGPT can produce, for example, gcloud shell commands to manage resources is lightning fast (all of which is very readable, just takes a while if you want to look it up and compose the commands yourself).

If your quality control method is "making sure there are no mistakes" then it's already broken regardless of where the code comes from. Me reviewing AI code is no different from me reviewing anyone else's code.

Me testing AI code using unit or integration tests is no different from testing anyone else's code, or my own code for that matter.

4 comments

> Me reviewing AI code is no different from me reviewing anyone else's code.

I take your point, and on the whole I agree with your post, but this point is fundamentally _not_ correct, in that if I have a question about someone else's code I can ask them about their intention, state-of-mind, and understanding at the time they wrote it, and (subjectively, sure; but I think this is a reasonable claim) can _usually_ detect pretty well if they are bullshitting me when they respond. Asking AI for explanations tends to lead to extremely convincing and confident false justifications rather than an admission of error or doubt.

However:

> Me testing AI code using unit or integration tests is no different from testing anyone else's code, or my own code for that matter.

This is totally fair

> Asking AI for explanations tends to lead to extremely convincing and confident false justifications rather than an admission of error or doubt.

Not always true, AI can realise their own mistakes and they can learn. It's a feedback loop system, and / but as it stands this feedback of what is good and bad is provided by end-users and fed back into e.g. Copilot.

That loop is not a short one though. LLMs don't actively incorporate new information into its model while you're chatting with it. That goes into its context window/short term memory. That the inputs and outputs can be used when training the next model, or for fine tuning the current one doesn't change that the distinct steps of training and inference.
> AI can realise

wait did AGI happen? which AI is this?

stop anthropomorphizing them

No they can't. They can generate text that indicates they hallucinated, you can tell them to stop, and they won't.

They can generate text that appears to admit they are incapable of doing a certain task, and you can ask them to do it again, and they will happily try and fail again.

Sorry but give us some examples of an AI "realizing" its own mistakes, learning, and then not making the mistake again.

Also, if this were even remotely possible (which it is not), then we should be able to just get AIs with all of the mistakes pre-made, so it learned and not do them again, right? So it has already "realized" and "learned" which tasks it's incapable of, so it will actually refuse or find a different way.

Or is there something special about the way that _you_ show the AI its mistakes, that is somehow more capable of making it "learn" from those mistakes than actually training it?

I'm assuming by bullshitting you mean differentiating between LLM hallucinations and a human with low confidence in their code.

I've found that LLMs do sometimes acknowledge hallucinations. But really the check is much easier than a PR/questioning an author - just run the code given by the copilot and check that it works, just as if you typed it yourself.

> just run the code given by the copilot and check that it works

You've misunderstood my point. I'm not discussing the ability to check whether the code works as _I_ believe it should (as you say, that's easy to verify directly, by execution and/or testing); I'm referring to asking about intention or motivation of design choices by an author. Why this data structure rather than that one? Is this unusual or unidiomatic construction necessary in order to work around a quirk of the problem domain, or simply because the author had a brainfart or didn't know about the usual style? Are we introducing a queue here to allow for easy retries, or to decouple scaling of producers and consumers, or...? I can't evaluate the correctness of a choice without either knowing the motivation for it, or by learning the problem domain well enough to identify and make the choice myself - at which point the convenience of the AI solution is abnegated because I may as well have written it myself.

(ref: "Code only says what it does" - https://brooker.co.za/blog/2020/06/23/code.html)

And, yes, you can ask an LLM to clarify or explain its choices, but, like I said, the core problem is that they will confidently and convincingly lie to you. I'm not claiming that humans never lie - but a) I think (I hope!) they do it less often than LLMs do, and b) I believe (subjectively) that it tends to be easier to identify when a human is unsure of themself than when an LLM is.

> I can't evaluate the correctness of a choice without either knowing the motivation for it, or by learning the problem domain well enough to identify and make the choice myself - at which point the convenience of the AI solution is abnegated because I may as well have written it myself.

I think I usually accept code that is in the latter - the convenience is I did not need to spend any real energy implementing the solution or thinking too deeply about it. Sometimes the LLM will produce a more interesting approach that I did not consider initially but is actually nicer than what I wanted to do (afaik). Often it does what I want or something similar enough to what I would've written - just that it can do it instantly instead of me manually typing, doc searching, adding types, and correcting the code. If it does something weird that I don't agree with, I instead modify the prompt to align closer to the solution I had in mind. Much like Google, sometimes the first query does not do the trick and a query reformulation is required.

I wouldn't trust an LLM to write large chunks of code that I wouldn't have been able to write/figure out myself - it's more of a coding accelerant than an autonomous engineer for me (maybe that's where our PoVs diverged initially).

I suspect the similarity with PRs is that when I'm assigned a PR, I generally have enough knowledge about the proposed modification to have an opinion on how it should be done and the benefits/drawbacks of each implementation. The divergence from a PR is that I can ask the LLM for a modification of approach with just a few seconds and continue to ask for changes until I'm satisfied (so it doesn't matter if the LLM chose an approach I don't understand - I can just ask it to align with the approach I believe is optimal).

Multiple times in my s/w development career, I've had supervisors ask me why I am not typing code throughout the work day.

My response each time was along the lines of:

  When I write code, it is to reify the part of a solution which
  I understand.  This includes writing tests to certify same.

  There is no reason to do so before then.
> He was a bit "get off my lawny" about the idea of running any code you didn't write, especially bash scripts in a terminal.

I hacked together a CLI tool that provides an LLM a CRUD interface to my local file system for, letting it read, write, and execute, code and tests, and feeds it back the commands outputs.

And it was bootstrapped with me playing the role of CLI tool.

Mostly useless, a bit irresponsible, but fun.

If that idea engages you, might take a look at the openinterpreter GitHub.
> if I'm working on a team and people are writing code how is it any different? Everyone makes mistakes, I make mistakes.

because your colleagues know how to count

and they're not hallucinating while on the job

and if they try to slip an unrelated and subtle bug past you for the fifth time after asking them to do a very basic task, there are actual consequences instead of "we just need to check this colleague's code better"