Hacker News new | ask | show | jobs
Measuring What Matters: Construct Validity in Large Language Model Benchmarks (oxrml.com)
3 points by Cynddl 230 days ago
2 comments

A very large review of AI benchmarks that reveals a worrying trend in their effectiveness and scientific rigor