Hacker News new | ask | show | jobs
by codebyron 133 days ago
The 15-point drop from easy to hard is the number that stands out to me.

That suggests the architecture handles state accumulation across steps without compounding errors — which is the thing that kills most agent pipelines. Every other agent here shows exponential degradation as task length increases, which is what you'd expect from a naive screenshot-action loop with no error recovery.

Looking at the cookbook repo — are you doing any kind of structured DOM extraction before passing to the model, or is this pure vision? Curious whether the hard-task performance comes from better perception, better planning, or better recovery when an action doesn't produce the expected state change.