Hacker News new | ask | show | jobs
by bjackman 227 days ago
For what it's worth, I work on platforms infra at a hyperscaler and benchmarks are a complete fucking joke in my field too lol.

Ultimately we are measuring extremely measurable things that have an objective ground truth. And yet:

- we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

- the benchmarks are almost never predictive of the performance of real world workloads anyway

- we can obviously always just experiment in prod but then the noise levels are so high that you can entirely miss million-dollar losses. And by the time you get prod data you've already invested at best several engineer-weeks of effort.

AND this is a field where the economic incentives for accurate predictions are enormous.

In AI, you are measuring weird and fuzzy stuff, and you kinda have an incentive to just measure some noise that looks good for your stock price anyway. AND then there's contamination.

Looking at it this way, it would be very surprising if the world of LLM benchmarks was anything but a complete and utter shitshow!

5 comments

> we completely fail at statistics (the MAJORITY of analysis is literally just "here's the delta in the mean of these two samples". If I ever do see people gesturing at actual proper analysis, if prompted they'll always admit "yeah, well, we do come up with a p-value or a confidence interval, but we're pretty sure the way we calculate it is bullshit")

Sort of tangential, but as someone currently taking an intro statistics course and wondering why it's all not really clicking given how easy the material is, this for some reason makes me feel a lot better.

FWIW, I don't think intro stats is easy the way I normally see it taught. It focuses on formulae, tests, and step-by-step recipes without spending the time to properly develop intuition as to why those work, how they work, which ones you should use in unfamiliar scenarios, how you might find the right thing to do in unfamiliar scenarios, etc.

Pair that with skipping all the important problems (what is randomness, how do you formulate the right questions, how do you set up an experiment capable of collecting data which can actually answer those questions, etc), and it's a recipe for disaster.

It's just an exercise in box-ticking, and some students get lucky with an exceptional teacher, and others are independently able to develop the right instincts when they enter the class with the right background, but it's a disservice to almost everyone else.

I found the same when I was taking intro to stats - I did get a much better intuition for what stuff meant after reading 'superforecasting' by tetlock and gardner - I find I'm recommending that book a lot come to think of it.
“Here’s the throughout at sustained 100% load with the same ten sample queries repeated over and over.”

“The customers want lower latency at 30% load for unique queries.”

“Err… we can scale up for more throughput!”

ಠ_ಠ

And then when you ask if they disabled the query result cache before running their benchmarking, they blink and look confused.
Then you see 25% cache hit rate in production and realise that disabling it for benchmark is not a good option either.
In AI though, you also have the world trying to compete with you, so even if you do totally cheat and put the benchmark answers in your training set and over fit, if it turns out that you model sucks, it doesn't matter how much your marketing department tells everyone you scored 110% on SWE bench, if it doesn't work out that well in production, your announcement's going to flow as users discover it doesn't work that well on their personal/internal secret benchmarks and tell /r/localLLAMA it isn't worth the download.

Whatever happened with Llama 4?

Even a p-value is insufficient. Maybe can use some of this stuff https://web.stanford.edu/~swager/causal_inf_book.pdf
I have actually been thinking of hiring some training contractors to come in and teach people the basics of applied statistical inference. I think with a bit of internal selling, engineers would generally be interested enough to show up and pay attention. And I don't think we need very deep expertise, just a moderate bump in the ambient level of statistical awareness would probably go a long way.

It's not like there's a shortage of skills in this area, it seems like our one specific industry just has a weird blindspot.

Don’t most computer science programs require this? Mine had a statistics requirement
I don't know how it is in the US and other countries, but in my country I would say statistics is typically not taught well, at least in CS degrees. I was a very good student, always had good understanding at the subjects at university, but in the case of statistics they just taught us formulae and techniques as dogmas without much explanation of where they came from, why, and when to use them. It didn't help either that the exercises we did always applied them to things outside CS (clinical testing, people's heights and things like that) with no application we could directly relate to. As a result, when I finished the degree I had forgotten most of it, and when I started working I was surprised that it was actually useful.

When I talk about this with other CS people in my own country (Spain) they tend to refer similar experiences.

I had the same experience in the US
Id say your experience is being more monetized for growth for growth sake.
Actually I disagree that that's what's going on in the world of hyperscaler platforms. There is genuinely a staggering amount of money on the line with the efficiency of this platform. Plus, we have extremely sophisticated and performance-sensitive customers who are directly and continuously comparing us with our competitors.

This isn't just that nobody cares about the truth. People 100% care! If you actually degrade a performance metric as measured post-hoc in full prod, someone will 100% notice, and if you want to keep your feature un-rolled-back, you are probably gonna have to have a meeting with someone that has thousands of reports, and persuade them it's worth it to the business.

But you're always gonna have more luck if you can have that meeting _before_ you degrade it. But... it's usually pretty hard to figure out what the exact degradation is gonna be, because of the things in my previous comment...