Hacker News new | ask | show | jobs
by dev_tools_lab 73 days ago
This is exactly why single-model evaluation is dangerous. Benchmarks are gamed, but disagreement between models is harder to fake. Multi-model consensus catches what individual benchmarks miss.