|
|
|
|
|
by kathyyyyyyyliu
133 days ago
|
|
Promising numbers, especially if Online-Mind2Web better reflects real multi-step workflows than WebVoyager. Would love to see a quick breakdown of failure modes and variance by difficulty -- 80%+ on truly stateful web tasks is a strong claim. Either way, more realistic evals are a big win for the space. |
|