| HN Mirror

Yes. Been there, done that (by "that" I mean looking at attention heads, not generating verbal "justifications" -- the latter is on my want-to-try-it list, even if only out of curiosity) :-)

FYI, Vaswani-style query-key-value self-attention mechanisms can be understood as a type of capsule-routing algorithm -- one in which the capsules are in the form of vector embeddings (each representing a token in a context), the activations are in the form of attention heads (representing which input tokens are most active for each output token), and the number of input and output capsules is the same (for every input token there is an output token).

Here, I'm talking more generally about using capsule-routing algorithms in which the capsules can be of any shape (they can be vectors, matrices, or higher-order tensors), the activations can be computed via different proposed mechanisms (including self-attention of course), and the number of input and output capsules need not be the same (e.g., with some algorithms it's possible to have a variable number of input capsules and a fixed number of output capsules).

As I wrote elsewhere on this thread, the routing algorithms I find most interesting are those in which each output capsule is a probabilistic model that "must explain input data better than other output capsules" in order for the capsule to activate.[a]

[a] https://news.ycombinator.com/item?id=23067556