Hacker News new | ask | show | jobs
by Aurornis 31 days ago
> DS stopped every 30 minutes or so, saying it did full RE and it should all work now, while in fact, it didn't complete even 1% of it. It also looked for shortcuts again and again, despite me prompting heavily that the specific shortcut may not be used. It was a complete and utter failure.

This is my experience with non-SOTA models across the board. When you try them on little tasks and they work it feels amazing, but then you go deeper and you're back to going in loops and fighting the model for hours.

Switching back to a SOTA model immediately yields progress again.

When I read all of the comments from people saying they can't tell a difference between Opus and <insert open weight model here> I don't know if they haven't really used it much yet, or if they're just not doing anything complicated.

1 comments

Did you read the OP when he's exactly chiding the model you're glazing?
Did you intentionally miss the point of my comment? Substitute Opus for GPT-5.5 if you will. I use both as well as locally hosted models using some of your branches, even.
Fair enough. I agree with you - although DS4 Pro is a GPT 5 class model which scores 46% on ARC-AGI-2[^1]. It's behind by maybe 9 months, I think it's still good enough for a lot of complex tasks as well. They definitely need to work on a "just fucking works" harness like CC/Codex. Also thanks!

[^1] https://www.nist.gov/news-events/news/2026/05/caisi-evaluati...