Hacker News new | ask | show | jobs
by aspenmartin 14 days ago
OK I am having the wrong conversation, that you are right -- parent OP saying

- best model is still a human: this I SORT of agree with, but like I say its uneven

- response is: "this was true 6 months ago but is now false" -- that is sort of a mixed bag; if its saying we can now replace SWEs thats demonstrably wrong, if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right. I would argue this has been true for longer than 6 months.

- you say: "I've seen the code they produce without extensive help from human developers, this is clearly false." -- I agree with you that you need to help coding agents substantially, but I think at this point the convo is unclear what anyone is actually addressing or responding to

> No, I'm saying that the claim you were making ("current models are better than some non-model based standard X") does not follow from your premise ("current models are better than past models").

that isn't my premise though, but I admit I misread you. I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

> As stated, your argument was basically the classic "my 3-month-old is now twice the size he was when he was born" meme, except if the tweet claimed that the kid currently out weighed an elephant.

No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc) though they have serious limtiations. The "yea but when is failure rate low enough for us to replace an entire tranche of processes" thats a harder question to answer but the strong proxy for that is adoption.

> The point I made was that "a massive jump in adoption" doesn't actually imply "the models are actually good enough now", only that a lot more people think they are.

No but then the point I'm making is we're drifting further and further away from Occam's razor.

> No, it isn't. An inflection point is when the direction of curvature changes. If we crossed over into the diminishing returns part of the logistic function, that would be an inflection point (as would the case where we had been in the diminishing returns regime, but then progress went back to speeding up).

I admit inflection point may be the wrong term here, but I hope you know at least what I'm trying to say; maybe like regime change or something. But plenty of data supports a major change around Nov to ~Jan. revenue, weekly active users, business subscriptions, GitHub commit estimates, you pick which is your favorite data source but they all are complimentary and all point to the same thing.

1 comments

First, to clarify my own position here: I use LLMs for code review, to help with some planning, for the occasional throwaway prototype, and as a more advanced rubber duck, but I do not let LLMs write code I care about, even with human review (because human review is imperfect).

> its uneven... if its saying it now clearly has superior abilities in many parts of the SWE workflow then its demonstrably right.

In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

> I think you are saying "people keep saying it's good enough to replace SWEs(?)" every six months and they are wrong every time. I don't disagree that we have not gotten to a "we dont need SWEs anymore" point, but I think its a bit of a strawman: who is making the claim you are addressing?

Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

> No no, I'm talking about performance in absolute terms. These are strong proxies (SWE-bench, etc)

Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

> No but then the point I'm making is we're drifting further and further away from Occam's razor.

Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.

> I do not let LLMs write code I care about, even with human review (because human review is imperfect).

That's fine but you are in the quickly vanishing minority.

> In what ways are current LLMs better excluding speed and cost (because getting those things by relaxing constraints on quality has always been trivially possible)? Even the fabled (heh) Mythos seems to be at best roughly equivalent to a competent human security researcher.

Well this is what I mean by benchmarks and measurement efforts. Lots of gaps in capabilities but we've had say superhuman competitive programming performance for awhile (including on fresh tasks not in training sets), extremely strong performance (super-p90-engineer) on say language-to-language porting, RE-bench (ML research engineering benchmark from METR) is already clearly above human perf, Mythos clearly (unless you believe this is all a massive fraud) has superior cyber capabilities, etc. Also, why do you discount speed and cost so much?

> Most of them aren't saying that the models are good enough to full replace developers, but this definitely isn't a strawman. I've been seeing the same basic claim for at least 18 months at this point.

Yea but what's the basic claim you're referring to here? Every model iteration is a significant bump up in performance according to a lot of complementary and principled measurements. What's been the thing that hasn't been true?

> Unless you have some non-LLM scores to compare to, those are still relative measures. They show/suggest that LLMs are getting better (at least in some ways), but without a definition of "good enough" in the same metric, that isn't sufficient to say whether or not they are.

There are human baselines in plenty of these benchmarks number one, and number two while no one is going to be able to tell you "once SWE-Bench Pro perf numbers get to X we can then refactor our existing process to completely offload task Y to agentic frameworks" thats a bit of a crazy ask. These numbers are pretty interpretable and many are pretty robust to things like training set leakage. What would you want to see here?

> Both sides of the debate have to explain the fact that a lot of developers disagree with them, so I don't think this argument really works.

Yet one side has a mountain of hard evidence and the other side has...an outdated n < 20 METR study using Sonnet 3?

> we've had say superhuman competitive programming performance for awhile

Fair. Question though, is this when compared to competitive programmers, or developers in general?

> extremely strong performance (super-p90-engineer) on say language-to-language porting

I'd need to see the methodology here and could easily be wrong, but I suspect this is largely down to "faster" and "willing to do a lot more of it without complaining"

> RE-bench (ML research engineering benchmark from METR) is already clearly above human perf

This pretty much has to be "relative to devs who don't specialize in that area", because if it wasn't the frontier labs wouldn't be paying a fortune to hire ML researchers.

> Mythos clearly has superior cyber capabilities

Based on Daniel Stenberg's experience with it [0], it seems like it's at best roughly on par with human experts. It's advantage is cost/speed.

> Also, why do you discount speed and cost so much?

Because in all the domains LLMs are applicable to, getting something cheaper/faster at the expense of quality isn't new or particularly interesting.

> Every model iteration is a significant bump up in performance according to a lot of complementary and principled measurements. What's been the thing that hasn't been true?

That they were good enough. To reuse the baby analogy, if every week your friend told you that their infant child was now heavier than an elephant (while acknowledging that the baby was lighter than one the previous week), and every week that turned out not to be true, it wouldn't be a defense of your friend to argue "ah, but the baby was heavier every week than the week before".

Also worth noting that as of ~8 months ago, while benchmark scores were steadily increasing, merge rates (aka whether the code was "good enough") were not [1].

> thats a bit of a crazy ask.

Why? If you use LLMs to do anything you're basically doing that already, it's just that the scope of your Y is smaller. Either the benchmarks are irrelevant and you're using something else to determine when that's appropriate for a given Y, or you do in fact have a value of X for the Y's you've handed over to LLMs.

> Yet one side has a mountain of hard evidence and the other side has...an outdated n < 20 METR study using Sonnet 3?

There's a lot of irony here, because by far the most common pro-LLM coding argument is "I feel like I'm producing good code faster with them", followed by "this other person feels like they're producing good code faster with them".

Also note that the most important part of the METR study you reference wasn't the slowdown they observed, it was the dramatic disagreement between what the participants thought the impact of AI was vs what it actually was. That isn't dependent on the model.

[0] https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-v...

[1] https://entropicthoughts.com/no-swe-bench-improvement