| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anorwell 123 days ago

A pastime I have with papers like this is to look for the part in the paper where they say which models they tested. Very often, you find either A) it's a model from one or more years ago, only just being published now, or B) they don't even say which model they are using. Best I could find in this paper:

> We evaluated 11 user-facing production LLMs: four proprietary models from OpenAI, Anthropic, and Google; and seven open-weight models from Meta, Qwen, DeepSeek, and Mistral.

(and graphs include model _sizes_, but not versions, for open weight models only.)

I can't apprehend how including what model you are testing is not commonly understood to be a basic requirement.

11 comments

dns_snek 123 days ago

And how is this comment relevant here? The abstract lists the digestible model names, and you can find the details in the supplementary text:

> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).

edit: It looks like OP attached the wrong link to the paper!

The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352

But the link in OP's post points to (what seems to be) a completely unrelated study.

vorticalbox 122 days ago

"OpenAI’s GPT-5" is ambiguous. Does that mean GPT-5, 5.1, 5.2, 5.3, or 5.4? Does it include the full model, or the nano/mini variants?

dns_snek 122 days ago

GPT-5 is not ambiguous, it's the official name of the model that released in August last year.

> All evaluations were done in March - August 2025.

vorticalbox 122 days ago

while true, all the others got precise identifiers but for openAI it makes it hard to reproduce because i have no idea "which" GPT-5 was used.

gardenerik 121 days ago

It was called just GPT-5 at that point in time.

prjkt 122 days ago

In that case, what tokenizer version? What was the temperature set to? topk? topp? FP32? FP16? Quantized? Hopper? Blackwell?

zjp 123 days ago

Also, nothing has changed! Claude will still yes-and whatever you give it. ChatGPT still has its insufferable personality, where it takes what you said and hands it back to you in different terms as if it's ChatGPT's insight.

Terretta 122 days ago

OTOH, for Claude the study says 39% yessy, same as humans, 2nd lowest yessing of the LLMs; GPT5 above 50% yessy.

emp17344 122 days ago

No dude, you don’t understand! It’s just so advanced now that you aren’t allowed to levy any criticism whatsoever!

TrainedMonkey 123 days ago

It's almost like it is based on the training data and regimen that is largely the same between versions.

dryarzeg 122 days ago

Well yes, but no. There's also open-weight models, and literally all of the listed above are not used anymore, at least by most end users and developers as far as I'm aware.

edgyquant 122 days ago

No study of ai can ever be done or be relevant because ever couple of months they are a new number to the name of the model thus invalidating all work around model behavior

dryarzeg 120 days ago

Yes, you are right. Sorry, I missed that out. It's just that all the open-weight models mentioned were... One year old or older. I just forgot that, firstly, such research is rarely done on frontier models because it takes time (you start with Llama 3.3, but look, one month later there's Llama 4), and secondly, there's also a publication delay. I think I'm just too used to the world of software, where everything moves at lightning speed. Sorry : )

latexr 122 days ago

> A pastime I have with papers like this is to look for the part in the paper where they say which models they tested.

My pastime (not really) in HN submissions like this is to look for the comment where someone complains about the models used because they aren’t the literal same model and version the commenter has started using the day before.

It’s always “you can’t test with those models, those are crap, the ones we have now are much better”, in perpetuity. It’s Schrödinger’s LLM: simultaneously god-like and a piece of garbage depending on the needs of the discussion. It’s beyond moving the goalposts, it’s moving the entire football field. It’s a clear bad faith attempt to try to discredit any study the commenter doesn’t like. Which you can always do because you can’t test literally everything.

zelphirkalt 122 days ago

The GP's criticism as I read it is about paper authors not making it particularly easy to reproduce their findings.

For a long time I have criticized this too, especially for software projects, or papers that deal with machine learning models. If the things described in a paper are not reproducible, then it's basically worthless. Similar to "it works on my machine" in software engineering. Many paper authors are not software engineers, and often neither are they experts in the tooling they should be using to make their research reproducible. If this is a problem for a research team, then please, hire an engineer to ensure reproducible. It doesn't help anyone to remain ignorant towards the reproducibility issue and only shows lack of scientific discipline. Reproducibility should be on the mind of any serious researcher and there should be lectures about how to do it at universities.

DrewADesign 122 days ago

Firing off glib criticism that amounts to “No study on AI is valid beyond the release cycle of the models tested,” feels like the unconscious self-protection reflex we all default to when facing cognitive dissonance. It seems like it’s only easy to spot when someone you disagree with is doing it.

To me, it almost feels like a partisan political thing.

zulban 123 days ago

Generally, published papers don't give a damn about reproducibility. I've seen it identified as a crisis by many. Publishers, reviewers, and researchers mostly don't care about that level of basic rigor. There's no professional repercussions or embarrassment.

Agreed - if I was a reviewer for LLM papers it would be an instant rejection not listing the versions and prompts used.

epistasis 123 days ago

I'm not so sure of that opinion on reproducibility. The last peer review I did was for a small journal that explicitly does not evaluate for high scientific significance, merely for correctness, which generally means straightforward acceptance. The other two reviews were positive, as was mine, except I said that the methods need to be described more and ideally the code placed somewhere. That was enough for a complete rejection of the paper, without asking for the simple revisions I requested. It was a very serious action taken merely because I requested better reproducibility!

(Personally I think the lack of reproducibility comes back mostly to peer reviewers that haven't thought through enough about the steps they'd need to take to reproduce, and instead focus on the results...)

zulban 122 days ago

I'm not sure how one example contradicts documented huge overall trends, but okay.

epistasis 122 days ago

I think publishers care about this a lot, but most researchers do not seem to care as much about reproducibility.

catlifeonmars 123 days ago

> and instead focus on the results...

This points to (and everyone knows this) incentives misalignment between the funders of research and the public. Researchers are caught in the middle

epistasis 122 days ago

Eh, I'm not so sure about the funding side there, researchers are not really caught at all and are fully responsible, IMHO. Peer reviewers exist to enforce community standards, and are not influenced to avoid reproducibility concerns by funding sources. The results are always more interesting than reproducibility, of course, and I think that's why the get the attention! Also, there needs to be greater involvement of grad students (who do most of the actual work) in peer review, IMHO, because most PIs spend their day in meetings reviewing results, setting directions, writing grants, and have little time for actual lab work, and are thus disconnected from it.

There needs to be more public naming and shaming in science social media and in conference talks, but especially when there are social gatherings at conferences and people are able to gossip. There was a bit of this with Google's various papers, as they got away with figurative murder on lack of reproducibility for commercial purposes. But eventually Google did share more.

Most journals have standards for depositing expensive datasets, but that's a clear yes/no answer. Reproducibility is a very subjective question in comparison to data deposition, and must be subjectively evaluated by peer reviewers. I'd like to see more peer review guidelines with explicit check boxes for various aspects of reproducibility.

catlifeonmars 122 days ago

> Reproducibility is a very subjective question in comparison to data deposition

Yeah I can definitely see why this is the case because it isn’t real until someone actually tries to reproduce the results. At that point it leaves the realm of subjectivity and becomes a question of cost.

bjourne 122 days ago

The comment is wrong -- model versions are clearly specified in the supplement.

ghywertelling 123 days ago

The same about surveys and polls. I know no one who has ever been polled or surveyed. When will we stop this fascination with made up infographics crisis?

inetknght 122 days ago

> Generally, published papers don't give a damn about reproducibility

While this is sadly true, it's especially true when talking about things that are stochastic in nature.

LLMs outputs, for example, are notoriously unreproducible.

zulban 122 days ago

> LLMs outputs, for example, are notoriously unreproducible.

Only in the same way that an individual in a medical study cannot be "reproduced" for the next study. However the overall statistical outcomes of studying a specific LLM can be reproduced.

KellyCriterion 123 days ago

Do they reproduce any submitted papers at all?

Does this happen?

I can remember this room-temperature-super-conductor guy whose experiments where replicated, but this seems rare?

linhns 123 days ago

Yes, those are the only papers that worth a jot of reading.

jameshart 122 days ago

I think it’s very important to be clear what studies like this are actually doing.

This study, although it has been produced by a computer science department, belongs more to the field of sociology or media studies than it does to computer science.

This is a study about the way in which human beings consume a particular media product - a consumer AI chatbot - not a study about the technological limitations or capabilities of LLMs.

The social impact of particular pieces of software is a legitimate field of study and I can see the argument that it belongs in the broadly defined field of computer science. But this sort of question is much more similar to ‘how does the adoption of spreadsheet software in finance impact the ease of committing fraud’ or ‘how does the use of presentation software to condense ideas down to bulletpoints impact organizational decision making’. Software has a social dimension and it needs to be examined.

But the question of which models were used is of much less relevance to such a study than that they used ‘whatever capability is currently offered to consumers who commonly use chat software’. Just like in a media studies investigation into how viewing cop dramas impacts jury verdicts the question is less ‘which cop dramas did they pick to study?’ So long as the ones they picked were representative of what typical viewers see.

yacin 123 days ago

Any paper like this would easily take a year or more to write and go through the submission/review/rebuttal/revision/acceptance process. I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness? What should they do, publish sub-standard results more quickly?

anorwell 122 days ago

> I don't understand why the models being a year or two old now is worth noting as though it's a clear weakness?

I do think it's a clear weakness. Capabilities are extremely different than they were twelve months ago.

> What should they do, publish sub-standard results more quickly?

Ideally, publish quality results more quickly.

I'm quite open to competing viewpoints here, but it's my impression that academic publishing cycle isn't really contributing to the AI discussion in a substantive way. The landscape is just moving too quickly.

yacin 122 days ago

The onus is on you to prove or at least convincingly argue that the results are unlikely to generalize across incremental model releases. In my personal experience, the overly affirming nature seems to have held since GPT-3. What makes you think a newer, larger model would not exhibit this behavior? Beyond "they're more capable"? I'd argue that being more capable doesn't mean less sycophantic.

It's certainly possible some of the new advances (chain-of-thought, some kind of agentic architecture) could lessen or remove this effect. But that's not what the paper was studying! And if you feel strongly about it, you could try to further the discussion with results instead of handwavingly dismissing others' work.

senordevnyc 122 days ago

The onus of persuasion is on the persuader, and publishing a study on old models that no one uses anymore isn’t persuasive. I don’t need to prove anything to decide that you haven’t changed my mind.

imtringued 121 days ago

By this logic there can be hundreds of studies that all show the pattern, including a 100% accurate prediction of the results for the next model and none of them would be "persuasive", because OpenAI decided to always release a new model the day before the paper is published.

So what you're saying here is that you were never open to "persuasion" and it was just a front to waste everyone's time.

mkagenius 122 days ago

I think you are absolutely right. (had to)

imtringued 121 days ago

Capabilities are not the same thing as personality.

Upgrading a robot that knows how to lay bricks to one that also knows how to lay plaster won't make it a better therapist.

drfloyd51 123 days ago

It’s as if they are testing “AI” and not specific agents.

I wonder if that is left over from testing people. I have major version numbers and my minor version number changes daily, often as a surprise. Sometimes several times a day. So testing people is a bit tricky. But AIs do have stable version numbers and can be specifically compared.

phyzome 122 days ago

Yeah, these idiots obviously should have been testing models from 1-2 years in the future so that by the time their paper is released, the models are current.

rco8786 123 days ago

If they’re reaching the same results across a variety of the most popular public models, it doesn’t seem like that big a deal to know if it was Opus 4 or Opus 4.5

hn_throwaway_99 123 days ago

Reproducibility is (supposed to be) a cornerstone of science. Model versions are absolutely critical to understand what was actually tested and how to reproduce it.

joaogui1 123 days ago

The models get deprecated after 1-2 years, so reproducibility is pretty hard anyway (but as others pointed out the paper does list the model versions)

jmkni 123 days ago

How many people using AI are actually paying for it (outside of people in tech)?

I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up, and I wonder if these are the ones most people are using?

theshackleford 122 days ago

> I find the free models are much more psychophantic and have a higher tendency to hallucinate and just make shit up

I keep seeing this claim yet it my experience it doesnt hold water. I pay for the models, most people I know pay for the models, and we see all of the exact same issues.

I have Claude and ChatGPT both bullshit and lick my ass on the regular. The ass licking will occur regardless of instruction.

yawnxyz 122 days ago

Usually the models are a year old bc the paper review process is utter crap, and papers take about a year to get published

Underphil 122 days ago

"Apprehend"?