| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bnchrch 4 days ago

An 11% jump over opus 4.8 and a 22% jump over gpt 5.5 on Agentic Coding Benchmarks is certainly impressive.

Obviously still need to verify it for myself to see if it's truely a leap.

But am I the only one wondering, "What can I do today that I couldnt do yesterday?"

Previously I would think "Oh I wonder if I can finally get it to do X now?"

However now I feel like yesterdays models were more that capable to handle nearly any engineering task I paired with it on.

Maybe this is the final leap where I can comfortable set up an autonomous coding loop? Maybe.

1 comments

AlexSonn 4 days ago

Agree the per-task capability hasn't been the blocker for a while. But on the autonomous-loop question — in my experience that's not gated by how good the model is on any single step. What kills the loop is it slowly losing the constraints from earlier in the run and walking back decisions you'd already settled.

link