| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aspenmartin 50 days ago
	Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.

5 comments

computerex 50 days ago

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

link

aspenmartin 50 days ago

Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.

Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.

You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.

link

taormina 50 days ago

Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?

link

aspenmartin 50 days ago

And you can say “If it can’t do the most basic of things at least as good as it used to, this is table stakes” all day long while people point you to much better evidence to the contrary too, I’d rather be on the other side of that.

link

taormina 50 days ago

Listen. I don’t care about evidence. I care about my lived experience for the product I paid for. I used the new product. It’s actively terrible. To the point of not being usable. We’re all ancedata, but what is “better evidence to the contrary”? The known and game-able benchmarks that they know they need to win at, so they train it to. It’s all he said, she said, which is the only reason we keep having this conversation.

link

aspenmartin 50 days ago

Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.

link

gen220 50 days ago

Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.

link

aspenmartin 50 days ago

That’s your call just don’t expect anyone ever to take that seriously. It’s not like we don’t have exact evaluations like this.

link

gen220 50 days ago

I would encourage you to look into the open evals of some of these benchmarks (find one that actually is open-data, this is itself a good challenge), read the results generated and assess them for yourself.

This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.

The people with real incentives and skin in the game are telling you that the data diverges from "the data".

I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.

But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.

link

aspenmartin 49 days ago

Investigating on your specific use cases, codebases, workflows and tasks is important, there is nothing wrong with this and in fact it’s more important than benchmarks if you can do it well but the point is that is very hard and easy to totally fool yourself and go down a suboptimal path. I understand that people are going to do it regardless, I certainly do. And I have looked at more raw benchmark data than I can really even stomach, I can see annotation data in my dreams now.

Eyes and ears of others is incredibly important. But you still seem to think somehow benchmarks is part of some giant conspiratorial cabal. You have institutions without ANY skin in the game making extremely high quality benchmarks. Consider in academia there is little else to do outside of partnerships with these companies. But benchmarks you can do completely independently and with university grant level money (it costs maybe $10-100k for a reasonable benchmark in many cases). Not only that, “real tasks” are what many benchmarks measure. You have these companies with extremely good logging and well scaled measurements to really look at what works and what doesn’t.

link

gen220 49 days ago

At this point I have a workflow that is fairly rote. I've yet to use a model newer than 4.6-1M-XHIGH that I trust to earn a higher ROI on that workflow, and not for lack of trying!

I personally don't believe in any sort of cabal (Occam's Razor hasn't let me down yet). Ultimately, I don't really care *why* they're wrong as much as I care *that* they have diverged from my rubber-meets-the-road measures of value.

That is concerning to me, because people are investing 100s of B's of capital based on the putative RoI putatively available to people like ourselves. When the benchmarks support this RoI thesis, but none of the anecdata does... that's really concerning!

Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing. And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.

link

aspenmartin 49 days ago

I am in full support of custom workflow benchmarks, and choosing the best model for your use case to balance performance and expense. Thats just good operating behavior, but the problem is the foot guns and biases people have that they are convinced they dont even if they understand on an intellectual level that everyone else has them

> but none of the anecdata does... that's really concerning!

But see this is not really true -- adoption, subjective benchmarks, verifiable benchmarks, task-dependent performance, internal product metrics, living benchmarks, all point in a pretty consistent direction. Anecdata is not the plural of data. An anecdote is like a case study. It's there to motivate the things we already have which is a huge amount of performance measures for a variety of different tasks.

> Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing.

But this isn't really true either -- you can get this data from a variety of sources that are licensable or open source, or data that you can commission. You can critique any one methodology for this but a blanket "they are hamstrung" is not really fair or accurate.

> And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.

But this is also not true -- you can have exclusive license agreements, data you hold close to the heart, or data to measure models that haven't had access to it because that data was created after these models were released.

There are plenty of problems in model measurement but the answer is not to just abandon it to be cavemen with zero respect for rigor and the biases we have to be subject to as human beings.

link

recitedropper 50 days ago

"Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.

Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.

link

aspenmartin 50 days ago

You can call it a cult but it’s several thousand skilled workers who know what they’re doing, by and large, most of whom have a PhD and know how science and statistics work. Benchmarks are incredibly hard, and any PR or comms department at any company is going to obviously want to make things as rosy as possible, but beneath this are earnest, expensive efforts to get good quality measurements. The better you can do this the better you can compete. If you want to make a modeling decision you run an ablation, and the quality of that decision is only as good as your measurements.

link

recitedropper 50 days ago

The cult in this case is TESCREAL, not everyone working on AI. Last I checked not all the "several thousand skilled workers" in AI subscribe to TESCREAL ideology, although it has been a while since I've been to the Bay. Maybe things have changed since my time at Berkeley, and Dario's belief that he will eventually be made immortal by mind uploading is more widespread.

Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.

link

OtomotO 50 days ago

There is no data that I would trust that contradicts it.

Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).

Claude was heavily lobotomised for my work starting somewhen in February.

I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)

I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...

link

aspenmartin 50 days ago

That’s ok but at what point is this getting into conspiracy territory? You have just said there is nothing you would believe to the contrary, but then by definition that’s not exactly a very thoughtful or insightful position.

link

OtomotO 49 days ago

I never said that I am not willing to believe the contrary.

I am not willing to believe the contrary from strangers on the interwebs or PR departments of companies who want to sell me something.

If people I genuinely trust tell me about their experiences, I am willing to try again.

But yes, if it doesn't work for me (for whatever reason, could be that I am holding it wrong), then I can accept that it works for everyone but me and still not use it.

Also "scientific" doesn't mean what it used to mean. When the n is small or it's just anecdotes (I am aware of the irony) blown out of proportion I really can't take the data and conclusions seriously

link

aspenmartin 49 days ago

N isn’t small, science means what it’s always meant, statistics is a thing, and what you’re describing is just putting your trust in a very poor quality benchmark. You said you would not trust any data that indicates something that contradicts your opinion. Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs. Again congratulations on your conspiracy theory.

link

OtomotO 49 days ago

> Again congratulations on your conspiracy theory.

I am neither impressed nor offended by any kind of argumentum ad hominem. I sincerely hope you have a wonderful day!

> Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs.

I don't give a crap about how good a shovel may be in a theoretical experiment when it's digging in sand, when I work with hard earth.

The ones I had a look at are mostly absolutely meaningless to my actual work.

> and what you’re describing is just putting your trust in a very poor quality benchmark.

And here is where we disagree fundamentally, so we can leave it at that.

Ex falso quodlibet

link

aspenmartin 49 days ago

> I don't give a crap about how good a shovel may be in a theoretical experiment when it's digging in sand, when I work with hard earth.

I don't know what this means, benchmark tasks are pretty hard and pretty in domain.

> The ones I had a look at are mostly absolutely meaningless to my actual work.

You've looked at 100,000 benchmarks?

> And here is where we disagree fundamentally, so we can leave it at that.

Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.

link

pythonaut_16 50 days ago

Seems like a bunch of noise. What does this even mean?

It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"

link

aspenmartin 50 days ago

No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for.

- evaluations need to be done at the same time to avoid drift in your bias

- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?

- which one did you do first? Raters have a tendency to bias in one direction or another

- you also know the label! You know which model is which! This biases your assessment…

And on and on and on. Careful science exists for a reason.

link