| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by euphetar 85 days ago
	I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents Try playing fruit ninja via text and llm toolcalls though