| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dinfinity 393 days ago

Imagine the irony if this article were to exaggerate the claims made in the study itself.

Perhaps saying things like "Most leading chatbots routinely exaggerate science findings" instead of "We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet [...] with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases".

To be fair, the article itself already mentions this: "Summaries by models (1), (4), (8), and (9) didn’t significantly differ in the kind of generalisations they contained from the original text. “So basically,” Peters concludes, “Claude, in different versions, did really well.”"