Y
Hacker News
new
|
ask
|
show
|
jobs
by
euphetar
38 days ago
I wouldn't call it a benchmark since it's just one sample. They do highlight a real problem, though. Computer use is immature right now and far behind language agents
Try playing fruit ninja via text and llm toolcalls though