|
|
|
|
|
by salmacodes
133 days ago
|
|
Been trying to get Operator to handle a multi-step workflow for a client (login → navigate nested menus → fill form → confirm) and it just... breaks in the middle every time. Seeing the hard-task numbers here makes that make a lot more sense. Honestly the more interesting thing to me is the benchmark critique. WebVoyager being the default eval while only agreeing with humans 62% of the time is kind of damning for the whole space. Has anyone else tried running their agent against Online-Mind2Web? |
|