| HN Mirror

> we've had say superhuman competitive programming performance for awhile

Fair. Question though, is this when compared to competitive programmers, or developers in general?

> extremely strong performance (super-p90-engineer) on say language-to-language porting

I'd need to see the methodology here and could easily be wrong, but I suspect this is largely down to "faster" and "willing to do a lot more of it without complaining"

> RE-bench (ML research engineering benchmark from METR) is already clearly above human perf

This pretty much has to be "relative to devs who don't specialize in that area", because if it wasn't the frontier labs wouldn't be paying a fortune to hire ML researchers.

> Mythos clearly has superior cyber capabilities

Based on Daniel Stenberg's experience with it [0], it seems like it's at best roughly on par with human experts. It's advantage is cost/speed.

> Also, why do you discount speed and cost so much?

Because in all the domains LLMs are applicable to, getting something cheaper/faster at the expense of quality isn't new or particularly interesting.

> Every model iteration is a significant bump up in performance according to a lot of complementary and principled measurements. What's been the thing that hasn't been true?

That they were good enough. To reuse the baby analogy, if every week your friend told you that their infant child was now heavier than an elephant (while acknowledging that the baby was lighter than one the previous week), and every week that turned out not to be true, it wouldn't be a defense of your friend to argue "ah, but the baby was heavier every week than the week before".

Also worth noting that as of ~8 months ago, while benchmark scores were steadily increasing, merge rates (aka whether the code was "good enough") were not [1].

> thats a bit of a crazy ask.

Why? If you use LLMs to do anything you're basically doing that already, it's just that the scope of your Y is smaller. Either the benchmarks are irrelevant and you're using something else to determine when that's appropriate for a given Y, or you do in fact have a value of X for the Y's you've handed over to LLMs.

> Yet one side has a mountain of hard evidence and the other side has...an outdated n < 20 METR study using Sonnet 3?

There's a lot of irony here, because by far the most common pro-LLM coding argument is "I feel like I'm producing good code faster with them", followed by "this other person feels like they're producing good code faster with them".

Also note that the most important part of the METR study you reference wasn't the slowdown they observed, it was the dramatic disagreement between what the participants thought the impact of AI was vs what it actually was. That isn't dependent on the model.

[0] https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-v...

[1] https://entropicthoughts.com/no-swe-bench-improvement