Hacker News new | ask | show | jobs
Measuring What Matters: Construct Validity in Large Language Model Benchmarks (arxiv.org)
1 points by Cynddl 223 days ago