| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dev_tools_lab 73 days ago
	This is exactly why single-model evaluation is dangerous. Benchmarks are gamed, but disagreement between models is harder to fake. Multi-model consensus catches what individual benchmarks miss.