Hacker News new | ask | show | jobs
by omalled 861 days ago
Can you say in one or two sentences what you mean by “lazy at coding” in this context?
3 comments

Short answer: Rather than fully writing code, GPT-4 Turbo often inserts comments like "... finish implementing function here ...". I made a benchmark based on asking it to refactor code that provokes and quantifies that behavior.

Longer answer:

I found that I could provoke lazy coding by giving GPT-4 Turbo refactoring tasks, where I ask it to refactor a large method out of a large class. I analyzed 9 popular open source python repos and found 89 such methods that were conceptually easy to refactor, and built them into a benchmark [0].

GPT succeeds on this task if it can remove the method from its original class and add it to the top level of the file with appropriate changes to the size of the abstract syntax tree. By checking that the size of the AST hasn't changed much, we can infer that GPT didn't replace a bunch of code with a comment like "... insert original method here...". The benchmark also gathers other laziness metrics like counting the number of new comments that contain "...". These metrics correlate well with the AST size tests.

[0] https://github.com/paul-gauthier/refactor-benchmark

I use gpt4-turbo through the api many times a day for coding. I have encountered this behavior maybe once or twice period. It was never an issue that didn’t make sense as essentially the model summarizing and/or assuming some shared knowledge (that was indeed known to me).

This, and people generally saying that chatGPT has been intentionally degraded, are just super strange for me. I believe it’s happening but it’s making me question my sanity. What am I doing to get decent outputs? Am I simply not as picky? I treat every conversion as though it needs to be vetted because it does regardless of how good the model is. I only trust output from the model that I am a subject matter expert on or in a closely adjacent field. Otherwise I treat it much like an internet comment - useful for surfacing curiosities but requires vetting.

IMO it is because there is a huge stochastic element to all this.

If we were all flipping coins there would be people claiming that coins only come up tails. There would be nothing they were doing though to make the coin come up tails. That is just the streak they had.

Some days I get lucky with chatGPT4 and some days I don't.

It is also ridiculous how we talk about this as if all subjects and context presented to chatGPT4 are going to be uniform in output. One word difference in your own prompt might change things completely while trying to accomplish exactly the same thing. Now scale that to all the people talking about chatGPT with everyone using it for something different.

I whenever the chatGPT gets lazy with the coding for example //make sure to implement search function .... I feed its own comments and code as prompt :you make sure to implement the search function and so on has been working for me
> I use gpt4-turbo through the api many times a day for coding.

Why this instead of GPT-4 through the web app? And how do you actually use it for coding, do you copy and paste your question into a python script, which then calls the OpenAI API and spits out the response?

Not the op, but I also use it through API (specifically MacGPT). My initial justification was that I would save by only paying for what I use, instead of a flat $20/mo, but now it looks like I’m not even saving much.
I use it fairly similarly via a discord bot I've written. This lets me share usage w/ some friends (although has some limitations compared to the openai chatGPT app).
I have a bunch of code I need to refactor, and also write tests for. (I guess I should make the tests before the refactor). How do you do a refactor with GPT-4? Do you just dump the file in to the chat window? I also pay for github copilot, but not GPT-4. Can I use copilot for this?

Any advice appreciated!

> Do you just dump the file in to the chat window?

Yes, along with what you want it to do.

> I also pay for github copilot, but not GPT-4. Can I use copilot for this?

Not that I know of. CoPilot is good at generating new code but can't change existing code.

GitHub Copilot Chat (which is part of Copilot) can change existing code. The UI is that you select some code, then tell it what you want. It returns a diff that you can accept or reject. https://docs.github.com/en/copilot/github-copilot-chat/about...
Copilot will change existing code. (though I find it's often not very good at it) I frequently highlight a section of code that has an issue, press ctrl-i and type something like "/fix SomeError: You did it wrong"
It has a tendency to do:

"// ... the rest of your code goes here"

in it's responses, rather than writing it all out.

It's incredibly lazy. I've tried to coax it into returning the full code and it will claim to follow the instructions while regurgitating the same output you complained about. GPT-4 was great, GPT-4 Turbo first version was pretty terrible bordering on unusable, then they came out with the Turbo second version, which almost feels worse to me, though I haven't compared, but if someone comes claiming they fixed an issue, but you still see it, it will bias you to see it more.

Claude is doing much better in this area, local/open LLMs are getting quite good, it feels like OpenAI is not heading in a good direction here, and I hope they course correct.

I have a feeling full powered LLM's are reserved for the more equal animals.

I hope some people remember and document details of this era, future generations may be so impressed with future reality that they may not even think to question it's fidelity, if that concept even exists in the future.

> I hope some people remember and document details of this era, future generations may be so impressed with future reality that they may not even think to question it's fidelity, if that concept even exists in the future.

The former sounds like a great training set to enable the latter. :(

…could you clarify? Is this about “LLMs can be biased, thus making fake news a bigger problem”?
I suspect it's sort of like "you can have a fully uncensored LLM iff you have the funds"
Imagine if the first version of ChatGPT we all saw was fully sanitised..

We know it knows how to make gunpowder (for example), but only because it would initially tell us.

Now it won't without a lot of trickery. Would we even be pushing to try and trick it into doing so if we didn't know it actually could?

> Would we even be pushing to try and trick it into doing so if we didn't know it actually could?

Would somebody try to push a technical system to do things it wasn't necessarily designed to be capable of? Uh... yes. You're asking this question on _Hacker_ News?

Ah so it’s more about “forbidden knowledge” than “fake news” makes sense. I don’t personally see as that toooo much of an issue since other sources still exist, eg Wikipedia, internet archive, libraries, or that one Minecraft Library of Alexandria project. So I see knowledge storage staying there and LLMs staying put in the interpretation/transformation role, for the foreseeable future.

But obviously all that social infrastructure is fragile… so you’re not wrong to be alarmed, IMO

I confidently predict that we sheep will not have access to the same power our shepherds will have.
People need to be using their local machines for this. Because otherwise the result is going to be a cloud service provider having literally everyone's business logic somewhere in their system and that goes wrong real quick.
It’s so interesting to see this discussion. I think this is a matter of “more experienced coders like and expect and reward that kind of output, while less experienced ones want very explicit responses”. So there’s this huge LLM Laziness epidemic that half the users cant even see
I'm paying for ChatGPT GPT4 to complete extremely tedious, repetitive coding tasks. The newly occurring laziness directly, negatively impacts my day to day use where I'm now willing to try alternatives. I still think I get value - indeed I'd probably pay $1,000/mo instead of $20/mo - but I'm only going to pay for one service.
I mean, isn't that better as long as it actually writes the part that was asked? Who wants to wait for it to sluggishly generate the entire script for the 5th time and then copy the entire thing yet again.
it was really good at some point last fall, solving problems that it had previously completely failed at, albeit after a lot of iterations via autogpt. at least for the tests i was giving it which usually involved heavy stats and complicated algorithms, i was surprised it passed. despite it passing the code was slower than what i had personally solved the problem with, but i was completely impressed because i asked hard problems.

nowadays the autogpt gives up sooner, seems less competent, and doesnt even come close to solving the same problems

Hamstringing high value tasks (complete code) to give forthcoming premium offerings greater differentiation could be a strategy. But in counter to this, doing so would open the door for competitors.
The question I have been wondering is if they are hamstringing high value tasks to creating room for premium offerings or are they trying to minimize cost per task.
I think it's the latter. Reading between the lines on costs gives me the impression they have strived to lower computational costs. They already added a cap of 30 queries per 3 hours...
this is exactly what I noticed too