Hacker News new | ask | show | jobs
by visarga 2242 days ago
If you want interpretability you can use Transformer and look at the attention heads. Or, like in a recent paper, train a language model to give textual justifications for its decision.
2 comments

Yes. Been there, done that (by "that" I mean looking at attention heads, not generating verbal "justifications" -- the latter is on my want-to-try-it list, even if only out of curiosity) :-)

FYI, Vaswani-style query-key-value self-attention mechanisms can be understood as a type of capsule-routing algorithm -- one in which the capsules are in the form of vector embeddings (each representing a token in a context), the activations are in the form of attention heads (representing which input tokens are most active for each output token), and the number of input and output capsules is the same (for every input token there is an output token).

Here, I'm talking more generally about using capsule-routing algorithms in which the capsules can be of any shape (they can be vectors, matrices, or higher-order tensors), the activations can be computed via different proposed mechanisms (including self-attention of course), and the number of input and output capsules need not be the same (e.g., with some algorithms it's possible to have a variable number of input capsules and a fixed number of output capsules).

As I wrote elsewhere on this thread, the routing algorithms I find most interesting are those in which each output capsule is a probabilistic model that "must explain input data better than other output capsules" in order for the capsule to activate.[a]

[a] https://news.ycombinator.com/item?id=23067556

> train a language model to give textual justifications for its decision.

This doesn't work for humans. Sure, they'll give an explanation, but they don't fully understand their own decision making process so they can't reliably explain it. I am not sure which paper you're referring to, but how did the researchers address this issue?