Hacker News new | ask | show | jobs
by snemvalts 15 days ago
Most benchmarks can be trained for as well, so they are over-representative of model's engineering skills. The entire nature of a benchmark is collapsing some qualitative work (software engineering task, architecture choice, code quality) into a quantitative score which can be optimized for.