Hacker News new | ask | show | jobs
by irisgrunn 634 days ago
And this is the major problem. People will blindly trust the output of AI because it appears to be amazing, this is how mistakes slip in. It might not be a big deal with the app you're working on, but in a banking app or medical equipment this can have a huge impact.
3 comments

I feel like I’m being gaslit about these AI code tools. I’ve got the paid copilot through work and I’ve just about never had it do anything useful ever.

I’m working on a reasonably large rails app and it can’t seem to answer any questions about anything, or even auto fill the names of methods defined in the app. Instead it just makes up names that seem plausible. It’s literally worse than the built in auto suggestions of vs code, because at least those are confirmed to be real names from the code.

Maybe these tools work well on a blank project where you are building basic login forms or something. But certainly not on an established code base.

I'm in the same boat. I've tried a few of these tools and the output's generally been terrible to useless big and small. It's made up plausible-sounding but non-existent methods on the popular framework we use, something which it should have plenty of context and examples on.

Dealing with the output is about the same as dealing with a code review for an extremely junior employee... who didn't even run and verify their code was functional before sending it for a code review.

Except here's the problem. Even for intermediate developers, I'm essentially always in a situation where the process of explaining the problem, providing feedback on a potential solution, answering questions, reviewing code and providing feedback, etc takes more time out of my day than it would for me to just _write the damn code myself_.

And it's much more difficult for me to explain the solution in English than in code--I basically already have the code in my head, now I'm going through a translation step to turn it into English.

All adding AI has done is taking the part of my job that is "think about problem, come up with solution, type code in" and make it into something with way more steps, all of which are lossy as far as translating my original intent to working code.

I get we all have different experiences and all that, but as I said... same boat. From _my_ experiences this is so far from useful that hearing people rant and rave about the productivity gains makes me feel like an insane person. I can't even _fathom_ how this would be helpful. How can I not be seeing it?

The biggest lie in all of LLMs is that they’ll work out of the box and you don’t need to take time to learn them.

I find Copilot autocomplete invaluable as a productivity boost, but that’s because I’ve now spent over two years learning how to best use it!

“And it's much more difficult for me to explain the solution in English than in code--I basically already have the code in my head, now I'm going through a translation step to turn it into English.”

If that’s the case, don’t prompt them in English. Prompt them in code (or pseudo-code) and get them to turn that into code that’s more likely to be finished and working.

I do that all the time: many of my LLM prompts are the signature of a function or a half-written piece of code where I add “finish this” at the end.

Here’s an example, where I had started manually writing a bunch of code and suddenly realized that it was probably enough context for the LLM to finish the job… which it did: https://simonwillison.net/2024/Apr/8/files-to-prompt/#buildi...

You bring up a good point! These tools are useless if you can't prompt them effectively.

I am decent at explaining what I want in English. I have coded and managed developers for long enough to include tips on how I want something implemented. So far, I am nothing short of amazed. The tools are nowhere near perfect, but they do provide a non-trivial boost in my productivity. I feel like I did when I first used an IDE.

> Except here's the problem. Even for intermediate developers, I'm essentially always in a situation where the process of explaining the problem, providing feedback on a potential solution, answering questions, reviewing code and providing feedback, etc takes more time out of my day than it would for me to just _write the damn code myself_.

Exactly. And I’ve been telling myself „keep doing that, it lets them teach, otherwise they will never level up and be able to comfortably and reliably work on this codebase without much hand holding. This will pay off”. Which I still think is true to a degree, although less so with every year.

At least with the humans I work with it’s _possible_ and I can occasionally find some evidence that it _could_ be true to hang on to. I’m expending extra effort, but I’m helping another human being and _maybe_ eventually making my own life easier.

What’s the payoff for doing this with an LLM? Even if it can learn, why not let someone else do it and try again next year and see if it’s leveled up yet?

For me, AI is super helpful with one-off scripts, which I happen to write quite often when doing research. Just yesterday, I had to check my assumptions are true about a certain aspect of our live system and all I had was a large file which had to be parsed. I asked ChatGPT to write a script which parses the data and presents it in a certain way. I don't trust ChatGPT 100%, so I reviewed the script and checked it returned correct outputs on a subset of data. It's something which I'd do to the script anyway if I wrote it myself, but it saved me like 20 minutes of typing and debugging the code. I was in a hurry because we had an incident that had to be resolved as soon as possible. I haven't tried it on proper codebases (and I think it's just not possible at this moment) but for quick scripts which automate research in an ad hoc manner, it's been super useful for me.

Another case is prototyping. A few weeks ago I made a prototype to show to the stakeholders, and it was generally way faster than if I wrote it myself.

It’s writing most of my code now. Even if it’s existing code you can feed in the 1-2 files in question and iterate on them. Works quite well as long as you break it down a bit.

It’s not gas lighting the latest versions of GPT, Claude, Lama have gotten quite good

These tools must be absolutely massively better than whatever Microsoft has then because I’ve found that GitHub copilot provides negative value, I’d be more productive just turning it off rather than auditing it’s incorrect answers hoping one day it’s as good as people market it as.
> These tools must be absolutely massively better than whatever Microsoft has then

I haven't used anything from Microsoft (including Copilot) so not sure how it compares, but compared to any local model I've been able to load, and various other remote 3rd party ones (like Claude), no one comes near to GPT4 from OpenAI, especially for coding. Maybe give that a try if you can.

It still produces overly verbose code and doesn't really think about structure well (kind of like a junior programmer), but with good prompting you can kind of address that somewhat.

My experience was the opposite.

GPT4 and variants would only respond in vagaries, and had to be endlessly prompted forward,

Claude was the opposite, wrote actual code, answered in detail, zero vagueness, could appropriately re-write and hoist bits of code.

Probably these services are so tuned (not as in "fine-tuned" ML style) to each individual user that it's hard to get any sort of collective sense of what works and what doesn't. Not having any transparency what so ever into how they tune the model for individual users doesn't help either.
My employer blocks ChatGPT at work and we are forced to use Copilot. It's trash. I use Google docs to communicate with GPT on my personal device. GPT is so much better. Copilot reminds me of GPT3. Plausible, but wrong all the time. GPT 4o and o1 are pretty much bang on most of the time.
Which languages do you use?
My experience is anecdotal, based on a sample size of one. I'm not writing to convince, but to share. Please take a look at my resume to see my background, so you can weight what I write.

I tried cursor because a technically-minded product manager colleague of mine managed to build a damned solid MVP of an AI chat agent with it. He is not a programmer, but knows enough to kick the can until things work. I figured if it worked for him, I might invest an hour of my time to check it out.

I went in with a time-boxed one hour time to install cursor and implement a single trivial feature. My app is not very sophisticated - mostly a bunch of setup flows and CRUD. However, there are some non-trivial things which I would expect to have documented in a wiki if I was building this with a team.

Cursor did really well. It generated code that was close to working. It figured out those not-obvious bits as well and the changes it made kept them in mind. This is something I would not expect from a junior dev, had I not explained those cross-dependencies to them (mostly keeping state synchronized according to business rule across different entities).

It did a poor job of applying those changes to my files. It would not add the code it generated in the right places and mess things up along the way. I felt I was wrestling with it a but too much to my liking. But once I figured this out I started hand-applying it's changes and reviewing them as I incorporated them into my code. This workflow was beautiful.

It was as if I sent a one paragraph description of the change I want, and received a text file with code snippets and instructions where to apply them.

I ended up spending four hours with cursor and giving it more and more sophisticated changes and larger features to implement. This is the first AI tool I tried where I gave it access to my codebase. I picked cursor because I've heard mixed reviews about others, and my time is valuable. It did not disappoint.

I can imagine it will trip up on a larger codebase. These tools are really young still. I don't know about other AI tools, and am planning on giving them a whirl in the near future.

Copilot is terrible. You need to use Cursor or at the very least Continue.dev w/ Claude Sonnet 3.5.

It's a massive gulf of difference.

That sounds almost like the complete opposite of my experience and I'm also working in a big Rails app. I wonder how our experiences can be so diametrically different.
What kind of things are you using it for? I’ve tried asking it things about the app and it only gives me generic answers that could apply to any app. I’ve tried asking it why certain things changed after a rails update and it gives me generic troubleshooting advice that could apply to anything. I’ve tried getting it to generate tests and it makes up names for things or generally gets it wrong.
OP here. I am explicitly NOT blindly trusting the output of the AI. I am treating it as a suspicious set of code written by an inexperienced developer. Doing full code review on it.
I don't think this criticism is valid at all.

What you are saying will occasionally happen, but mistakes already happen today.

Standards for quality, client expectations, competition for market share, all those are not going to go down just because there's a new tool that helps in creating software.

New tools bring with them new ways to make errors, it's always been that way and the world hasn't ended yet...