Hacker News new | ask | show | jobs
by rekenaut 381 days ago
I wonder if there is any connection between the models producing exaggerated outputs and the litany of exaggerated or overconfident claims that academic media offices or the press have produced from previous studies. Maybe the models trained on the studies and the reports on the studies naturally tend toward the style of attention seeking reports even when directly provided with the studies.
2 comments

This is the same mistake we were seeing in commercial use of AI.

   1. "This process is flawed due to human bias"
   2. Train AI/ML to make the same decisions with the same outcome
   3. "How can there be any flaws in this process? AI is bias-free."
A repost of a previous comment, but the anthropomophization of this tech is so off the charts, I feel like I'm going to be repeating it, a lot:

One of the most offensive words in the anthropomophization of LLMs is: hallucinate.

It's not only an anthropomorphism, it's also a euphemism.

A correct interpretation of the word would imply that the LLM has some fantastical vision that it mistakes for reality. What utter bullsh1t.

Let's just use the correct word for this type of output: wrong.

When the LLM generates a sequence of words, that may or may not be grammatically correct, but infers a state or conclusion that is not factually correct; lets state what actually happened: the LLM generated text was WRONG.

It didn't take a trip down Alice's rabbit hole, it just put words together into a stream that inferred a piece of information that was incorrect, it was just WRONG.

The euphemistic aspect of using this word is a greater offense than the anthropomorphism, because it's painting some cutesy picture of what happened, instead of accurately acknowledging that the s/w generated an incorrect result. It's covering up for the inherent short comings of the tech.

Imagine the irony if this article were to exaggerate the claims made in the study itself.

Perhaps saying things like "Most leading chatbots routinely exaggerate science findings" instead of "We tested 10 prominent LLMs, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, LLaMA 3.3 70B, and Claude 3.7 Sonnet [...] with DeepSeek, ChatGPT-4o, and LLaMA 3.3 70B overgeneralizing in 26–73% of cases".

To be fair, the article itself already mentions this: "Summaries by models (1), (4), (8), and (9) didn’t significantly differ in the kind of generalisations they contained from the original text. “So basically,” Peters concludes, “Claude, in different versions, did really well.”"