| For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol. Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet: - we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit") - the benchmarks are almost never predictive of the performance of real world workloads anyway - we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort. AND this is a field where the economic incentives for accurate predictions are enormous. In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination. Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow! |
Sort of tangential, but as someone currently taking an intro statistics course and wondering why it's all not really clicking given how easy the material is, this for some reason makes me feel a lot better.