| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kathyyyyyyyliu 133 days ago
	Promising numbers, especially if Online-Mind2Web better reflects real multi-step workflows than WebVoyager. Would love to see a quick breakdown of failure modes and variance by difficulty -- 80%+ on truly stateful web tasks is a strong claim. Either way, more realistic evals are a big win for the space.