Hacker News new | ask | show | jobs
by kathyyyyyyyliu 133 days ago
Promising numbers, especially if Online-Mind2Web better reflects real multi-step workflows than WebVoyager. Would love to see a quick breakdown of failure modes and variance by difficulty -- 80%+ on truly stateful web tasks is a strong claim. Either way, more realistic evals are a big win for the space.