PA Bench: Evaluating Frontier Models on Multi-Tab Pa Tasks

Most current web agent benchmarks focus on single-tab tasks (e.g., 'go to Gmail and star this email'). We found that frontier models that score highly on those tasks (like in WebArena) often fall apart when they have to coordinate context across 2+ applications. We built a simulated environment with scenarios and deterministic verifiers to see why.