Hacker News new | ask | show | jobs
by nwienert 72 days ago
4.5 is better than 4.6 though in practice. 4.6 was purely a cost savings change with enough benchmark gamification to look better.
2 comments

I've found Opus 4.6 to be smarter than 4.5, at least in some ways. There's a bug I'd been trying to solve for a decade (and so had other humans) and I've been giving it to each model to try and solve, including in interactive sessions. Each model got closer, but none of them actually solved it, until Opus 4.6 got it on the first go (I probably used Ultrathink). This was before the 1M context was available.

I'd agree that 4.6 and 4.5 are different, but I don't think it's correct that 4.6 is just reduced and benchmaxxed. It genuinely solved problems for me that no other model has been able to.

I think I'd like to have seen the 4.6 benchmarks also included against Qwen.

Exactly. 3.6 plus in the exact same coding agent harness is notably worse in all of my testing compared to 3.5 plus.

The former gets stuck in ridiculous thought loops on the exact same tasks I’m testing. Fascinating really, I expected more for some reason.