Y
Hacker News
new
|
ask
|
show
|
jobs
by
flakiness
215 days ago
To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.
3 comments
criddell
215 days ago
I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.
link
mdahardy
215 days ago
Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces
link
swyx
215 days ago
i mean did you see the cross-tile numbers
link