Hacker News new | ask | show | jobs
by flakiness 215 days ago
To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.
3 comments

I wonder if any of the agents hit the audio button and listened to the instructions? In my experience, that can be pretty helpful.
Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces
i mean did you see the cross-tile numbers