| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by throw310822 17 days ago
	Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"? The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"?

1 comments

kostaj 16 days ago

Agree that True and Mostly True might be very close and could be a calibration difference. Misleading and False, as well. A better headline number might be the 34% claims with substantial or polar-opposite verdicts.

link