| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by fancy_pantser 111 days ago
	It's looking rather low on reasoning and long-range problems with the approach described. For example, even with 16 agents and compaction, the HLE score is significantly below Anthropic's Mythos. Like you, I can see the release as a net Good Thing, but apples-to-apples for each org's latest models do have Meta holding steady in the middle pack.

1 comments

zozbot234 111 days ago

HLE encompasses very hard problems where the larger pretraining of Mythos probably matters quite a bit. I'm not saying that Mythos is not showing some amount of genuine improvement compared to e.g. the latest Opus; just that if you're going to compare models, you should at least make sure that the overall test-time workload is in the same ballpark given how high it seems to be for Mythos.

link