| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by deyiao 583 days ago
	The benchmark results seem unrealistically good, but I'm not sure from which angles I should challenge them.

1 comments

I think they're real. The model is performing better than claude-3-5-sonnet-20241022 on the claude leaderboard: