| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kirtivr 22 days ago
	Some super relevant benchmarks like Humanity's Last Exam, Long context reasoning (MRCR 128K-256K) are not included. Overall this seems to be a strong agent-oriented model. What are the benchmarks that most closely track model coding performance in the real world?