Y
Hacker News
new
|
ask
|
show
|
jobs
by
dev_tools_lab
73 days ago
This is exactly why single-model evaluation is dangerous. Benchmarks are gamed, but disagreement between models is harder to fake. Multi-model consensus catches what individual benchmarks miss.