| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by entrope 16 days ago
	I do not think it is "lazy". Those labels are ones that human fact-checkers have been using for a decade or more. I think those human fact-checkers use those terms knowing full well that there is overlap and ambiguity between them. So I think this study ends up mixing three effects: how LLMs interpret the claims as statements about the world, how LLMs reduce that to a four-category judgment, and the inherent ambiguities of those labels as natural language. It's a quantification of those three factors combined, but not powerful enough to distinguish their relative sizes.

2 comments

jstummbillig 16 days ago

I don't see how something being lazy for a decade makes it any less lazy. And lazy still seems right to me: They make a misleading point by omitting to collect and present important data. If the headline read "LLMs disagree on 67%, humans disagree on 75%" it would clearly project something very different.

Granted, there certainly are other unflattering adjectives one could have chosen to describe this instead.

link

kostaj 16 days ago

Quick note on the second effect - how LLMs reduce that to a four-category judgment: On 21% of the claims at least two models provide polar-opposite verdicts (at least one model False, and at least one model True). This might be a better measurement of the strict disagreement than the 67% disagreement on the four-bucket rubric.

link