|
|
|
|
|
by derac
14 days ago
|
|
I think Honesty can be evaluated. Does the model push back when it knows the user is wrong? How often does the model hallucinate data vs. say it doesn't know? Provide a prompt with contradictions or other issues and see if the model corrects you. Here is an article by Anthropic that explains what they do and mean in more detail:
https://alignment.anthropic.com/2025/honesty-elicitation/ |
|