Hacker News new | ask | show | jobs
by ivywho 133 days ago
Interesting that every agent basically falls off a cliff on hard tasks except this one. Operator going from 83% to 43% is wild - that means it's literally coin-flipping on anything non-trivial.

The failure traces being public is a nice touch. Looked through a few and they're actual failures, not cherry-picked easy ones. Most companies in this space wouldn't do that.

Curious about latency though, what does a typical hard task execution look like in terms of wall clock time?