|
|
|
|
|
by throw310822
17 days ago
|
|
Not sure I'm understanding this. The models are asked to evaluate the truth of random claims out of their own head (except for Gemini with search grounding)? Isn't it exactly the same as asking people to play any quiz game and then rating them as "they disagree n% of the time"? The output buckets are also pretty questionable- the difference between "True" and "Mostly true" is pretty fuzzy. Is this marked as a "disagreement"? |
|