|
|
|
|
|
by codebyron
133 days ago
|
|
The 15-point drop from easy to hard is the number that stands out to me. That suggests the architecture handles state accumulation across steps without compounding errors — which is the thing that kills most agent pipelines. Every other agent here shows exponential degradation as task length increases, which is what you'd expect from a naive screenshot-action loop with no error recovery. Looking at the cookbook repo — are you doing any kind of structured DOM extraction before passing to the model, or is this pure vision? Curious whether the hard-task performance comes from better perception, better planning, or better recovery when an action doesn't produce the expected state change. |
|