Hacker News new | ask | show | jobs
by sjmog 357 days ago
yesterday I posted a video testing claude-4 sonnet solving an https://simstack.io long-horizon swe challenge unaided (https://news.ycombinator.com/item?id=44424468). for comparison, here's gemini 2.5-pro.

I noticed that 2.5-pro is way more cavalier, skipping backups, and trying "more stuff more quickly" than claude's more cautious approach.