Hacker News new | ask | show | jobs
PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks (vibrantlabs.com)
7 points by shahules 124 days ago
1 comments

Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.