| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by codexon 132 days ago
	This paper creates a new benchmark comprised of real remote work tasks sourced from the remote working website Upwork. The best commercial LLMs like Opus, GPT, Gemini, and Grok were tested. Models released a few days ago, Opus 4.6 and GPT 5.3, haven't been tested yet, but given the performance on other micro-benchmarks, they will probably not be much different on this benchmark.

1 comments

kolinko 132 days ago

They didn't test Opus at all, only Sonnet.

One of the tasks was "Build an interactive dashboard for exploring data from the World Happiness Report." -- I can't imagine how Opus4.5 could've failed that.

link

codexon 131 days ago

Check the link to the study. It has been updated for Opus 4.5.

link