Hacker News new | ask | show | jobs
by salmacodes 133 days ago
Been trying to get Operator to handle a multi-step workflow for a client (login → navigate nested menus → fill form → confirm) and it just... breaks in the middle every time.

Seeing the hard-task numbers here makes that make a lot more sense.

Honestly the more interesting thing to me is the benchmark critique. WebVoyager being the default eval while only agreeing with humans 62% of the time is kind of damning for the whole space. Has anyone else tried running their agent against Online-Mind2Web?