Hacker News new | ask | show | jobs
by lunar_mycroft 17 days ago
I've seen the code they produce without extensive help from human developers, this is clearly false.

Good to see the classic "yeah the models weren't good enough six months ago, but this time they actually are, promise! Please forget you were hearing the exact same thing six months ago!" is alive and well though.

1 comments

Are you aware of performance trends though? You’re painting a picture that seems to ignore how things have consistently trended for many years now, even pre ChatGPT. It is absolutely data driven to say “an inflection point has happened within the last 6 months”. And that was also true 6 months ago (where people started using coding agents fairly consistently since sonnet 4). And it was true 6 months before that. It’s not like people are like “we’ve fixed all the bugs!” And then nothing has changed. I don’t necessarily agree with the parent poster that agents are better than humans but they are certainly much better at many tasks.
> Are you aware of performance trends though? You’re painting a picture that seems to ignore how things have consistently trended for many years now, even pre ChatGPT.

Models have been getting better, but all that follows from that is that newer models tend to be better than older ones. It doesn't follow that they have (or even will in the future) gotten better than anything else, be that human developers, a given definition of good enough, etc.

> It is absolutely data driven to say “an inflection point has happened within the last 6 months”.

With all due respect to OP (who I think is responsible for popularizing that way of phrasing it), I don't think it is when you consider the actual definition of "inflection point". At best I think you can say that models crossed a lot of developers definition of good enough around then, which is a different thing. The problem I have with that is that as a (mostly) outsider looking in, it doesn't seem like they're right.

> Models have been getting better, but all that follows from that is that newer models tend to be better than older ones. It doesn't follow that they have (or even will in the future) gotten better than anything else, be that human developers, a given definition of good enough, etc.

But this is not true, you’re saying we only have relative performance numbers and not absolute measures of capabilities and reliability but that’s simply not true. OSS benchmarks as well as the internal flywheels of these companies are good complementary measurements.

> At best I think you can say that models crossed a lot of developers definition of good enough around then, which is a different thing

That’s the inflection point. Implication is a massive jump in adoption. We’re not like pulling this out of a hat, there are a number of compelling datapoints. The onus is on people to bring actual evidence that contradicts all of the data and observations we have.

> you’re saying we only have relative performance numbers and not absolute measures of capabilities and reliability but that’s simply not true.

No, I'm saying that the claim you were making ("current models are better than some non-model based standard X") does not follow from your premise ("current models are better than past models"). It's possible that your claim is still true (although I don't think it is for most of the values of X that matter), but that wouldn't change the fact that the argument made is invalid.

As stated, your argument was basically the classic "my 3-month-old is now twice the size he was when he was born" meme, except if the tweet claimed that the kid currently out weighed an elephant.

> That’s the inflection point.

No, it isn't. An inflection point is when the direction of curvature changes. If we crossed over into the diminishing returns part of the logistic function, that would be an inflection point (as would the case where we had been in the diminishing returns regime, but then progress went back to speeding up).

> Implication is a massive jump in adoption.

The point I made was that "a massive jump in adoption" doesn't actually imply "the models are actually good enough now", only that a lot more people think they are.

OK I am having the wrong conversation, that you are right -- parent OP saying

- best model is still a human: this I SORT of agree with, but like I say its uneven

- response is: "this was true 6 months ago but is now false" -- that is sort of a mixed bag; if its saying we can now replace SWEs thats demonstrably wrong, if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right. I would argue this has been true for longer than 6 months.

- you say: "I've seen the code they produce without extensive help from human developers, this is clearly false." -- I agree with you that you need to help coding agents substantially, but I think at this point the convo is unclear what anyone is actually addressing or responding to

> No, I'm saying that the claim you were making ("current models are better than some non-model based standard X") does not follow from your premise ("current models are better than past models").

that isn't my premise though, but I admit I misread you. I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

> As stated, your argument was basically the classic "my 3-month-old is now twice the size he was when he was born" meme, except if the tweet claimed that the kid currently out weighed an elephant.

No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc) though they have serious limtiations. The "yea but when is failure rate low enough for us to replace an entire tranche of processes" thats a harder question to answer but the strong proxy for that is adoption.

> The point I made was that "a massive jump in adoption" doesn't actually imply "the models are actually good enough now", only that a lot more people think they are.

No but then the point I'm making is we're drifting further and further away from Occam's razor.

> No, it isn't. An inflection point is when the direction of curvature changes. If we crossed over into the diminishing returns part of the logistic function, that would be an inflection point (as would the case where we had been in the diminishing returns regime, but then progress went back to speeding up).

I admit inflection point may be the wrong term here, but I hope you know at least what I'm trying to say; maybe like regime change or something. But plenty of data supports a major change around Nov to ~Jan. revenue, weekly active users, business subscriptions, GitHub commit estimates, you pick which is your favorite data source but they all are complimentary and all point to the same thing.

First, to clarify my own position here: I use LLMs for code review, to help with some planning, for the occasional throwaway prototype, and as a more advanced rubber duck, but I do not let LLMs write code I care about, even with human review (because human review is imperfect).

> its uneven... if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right.

In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

> I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

> No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc)

Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

> No but then the point I'm making is we're drifting further and further away from Occam's razor.

Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.