| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dns_snek 89 days ago

And how is this comment relevant here? The abstract lists the digestible model names, and you can find the details in the supplementary text:

> To evaluate user-facing production LLMs, we studied four proprietary models: OpenAI’s GPT-5 and GPT- 4o (80), Google’s Gemini-1.5-Flash (81) and Anthropic’s Claude Sonnet 3.7 (82); and seven open-weight models: Meta’s Llama-3-8B-Instruct, Llama-4-Scout-17B-16E, and Llama-3.3-70B-Instruct-Turbo (83, 84); Mistral AI’s Mistral-7B-Instruct-v0.3 (85) and Mistral-Small-24B-Instruct-2501 (86); DeepSeek-V3 (87); and Qwen2.5-7B-Instruct-Turbo (88).

edit: It looks like OP attached the wrong link to the paper!

The article is about this Stanford study: https://www.science.org/doi/10.1126/science.aec8352

But the link in OP's post points to (what seems to be) a completely unrelated study.

2 comments

vorticalbox 89 days ago

"OpenAI’s GPT-5" is ambiguous. Does that mean GPT-5, 5.1, 5.2, 5.3, or 5.4? Does it include the full model, or the nano/mini variants?

link

dns_snek 89 days ago

GPT-5 is not ambiguous, it's the official name of the model that released in August last year.

> All evaluations were done in March - August 2025.

link

vorticalbox 89 days ago

while true, all the others got precise identifiers but for openAI it makes it hard to reproduce because i have no idea "which" GPT-5 was used.

link

gardenerik 87 days ago

It was called just GPT-5 at that point in time.

link

prjkt 88 days ago

In that case, what tokenizer version? What was the temperature set to? topk? topp? FP32? FP16? Quantized? Hopper? Blackwell?

link

zjp 89 days ago

Also, nothing has changed! Claude will still yes-and whatever you give it. ChatGPT still has its insufferable personality, where it takes what you said and hands it back to you in different terms as if it's ChatGPT's insight.

link

Terretta 88 days ago

OTOH, for Claude the study says 39% yessy, same as humans, 2nd lowest yessing of the LLMs; GPT5 above 50% yessy.

link

emp17344 89 days ago

No dude, you don’t understand! It’s just so advanced now that you aren’t allowed to levy any criticism whatsoever!

link

TrainedMonkey 89 days ago

It's almost like it is based on the training data and regimen that is largely the same between versions.

link

dryarzeg 89 days ago

Well yes, but no. There's also open-weight models, and literally all of the listed above are not used anymore, at least by most end users and developers as far as I'm aware.

link

edgyquant 88 days ago

No study of ai can ever be done or be relevant because ever couple of months they are a new number to the name of the model thus invalidating all work around model behavior

link

dryarzeg 86 days ago

Yes, you are right. Sorry, I missed that out. It's just that all the open-weight models mentioned were... One year old or older. I just forgot that, firstly, such research is rarely done on frontier models because it takes time (you start with Llama 3.3, but look, one month later there's Llama 4), and secondly, there's also a publication delay. I think I'm just too used to the world of software, where everything moves at lightning speed. Sorry : )

link