|
|
|
|
|
by bubblelicious
215 days ago
|
|
I work on LLM benchmarks and human evals for a living in a research lab (as opposed to product). I can say: it’s pretty much the Wild West and a total disaster. No one really has a good solution, and researchers are also in a huge rush and don’t want to end up making their whole job benchmarking. Even if you could, and even if you have the right background you can do benchmarks full time and they still would be a mess. Product testing (with traditional A/B tests) are kind of the best bet since you can measure what you care about _directly_ and at scale. I would say there is of course “benchmarketing” but generally people do sincerely want to make good benchmarks it’s just hard or impossible. For many of these problems we’re hitting capabilities where we don’t even have a decent paradigm to use, |
|
Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:
- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")
- the benchmarks are almost never predictive of the performance of real world workloads anyway
- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.
AND this is a field where the economic incentives for accurate predictions are enormous.
In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.
Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!