|
|
|
|
|
by ivywho
133 days ago
|
|
Interesting that every agent basically falls off a cliff on hard tasks except this one. Operator going from 83% to 43% is wild - that means it's literally coin-flipping on anything non-trivial. The failure traces being public is a nice touch. Looked through a few and they're actual failures, not cherry-picked easy ones. Most companies in this space wouldn't do that. Curious about latency though, what does a typical hard task execution look like in terms of wall clock time? |
|