Hacker News new | ask | show | jobs
by JoshuaDavid 2242 days ago
> train a language model to give textual justifications for its decision.

This doesn't work for humans. Sure, they'll give an explanation, but they don't fully understand their own decision making process so they can't reliably explain it. I am not sure which paper you're referring to, but how did the researchers address this issue?