| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cs702 2243 days ago

I'm surprised there is no mention of capsules and capsule-routing algorithms.

Capsules are groups of neurons that represent discrete entities in different contexts. For example, a 4x4 pose matrix is a capsule representing a particular object in different orientations seen from different viewpoints. Similarly, a subword embedding can be seen as a capsule with vector shape representing a particular subword in different natural language contexts. More generally, a capsule can have any shape, but it always represents only one entity in some context.

In certain new capsule-routing algorithms -- e.g., EM routing[a], Heinsen routing[b], dynamic routing[c], to name a few off the top of my head[d] -- each capsule can activate or not depending on whether the entity it represents is detected or not in the context of input data.

Models using these algorithms therefore make it possible for human beings to interpret model behavior in terms of capsule activations -- e.g., "the final layer predicts label 2 because capsules 7, 23, and 41 activated the most in the last hidden routing layer."

While these new routing algorithms are not yet widely used, in my humble opinion they present a promising avenue of research for building models that are explainable and/or enable assignment of causality at high levels of function composition.

[a] https://research.google/pubs/pub46653/

[b] https://arxiv.org/abs/1911.00792

[c] https://arxiv.org/abs/1710.09829

[d] If you're aware of other routing algorithms that can similarly activate/deactivate capsules, please post a link to the paper or code here.

2 comments

Eridrus 2243 days ago

It should be pretty obvious why, they don't work as well as what is standard. This is about ways to explain the models we get good performance with.

link

cs702 2243 days ago

Not sure I agree, for two reasons. First, capsule networks have been shown to outperform standard architectures in at least some tasks (see the above papers). Second, and perhaps more importantly, the large and growing chorus of people -- from corporate executives to government regulators -- asking for models that are "explainable" and "interpretable" really couldn't care less as to what kinds of models are used. (In my experience, non-technical people with decision-making power are almost always willing to trade performance for better explainability/interpretability/assignment.)

link

fxtentacle 2243 days ago

My personal experience with capsule networks is that they didn't work better than a similar number of ungrouped neurons in any case.

If capsules work wonders for you, my first guess would be that you can improve your training of the standard network to make it work equally well.

In general, my hunch is that capsules are still too low level and too much of a local change to make a strong difference.

To give an example, all of the state of the art optical flow AIs are based on building cost volumes and then resolving them. There are edge cases, where one can prove mathematically that reducing the cost volume to a flow direction will make it impossible to produce the correct result. So to make a significant contribution, it doesn't help to use capsules in the feature processing stage, but you need to replace the entire architecture.

link

cs702 2242 days ago

Thank you. You may be right. To some extent we're all guessing based on our own hunches :-)

FWIW, I've had the most success with EM/Heinsen-type routing algorithms -- that is, those in which each output capsule is generated by a probabilistic model (such as a Gaussian mixture), and the output capsule activates only to the extent its model can explain (i.e., generate) its view of input data better (in some quantifiable manner) than other output capsules. The notion that an output capsule "must explain input data better than other capsules in order to activate" is very appealing to me as a mechanism for inducing per-layer "explainability" in models.

In my experience so far, routing tends to work better on top of conventional architectures, e.g., use a ResNet for feature detection and stack two or more routing layers on top for classifying into hidden factors and then into training labels. Also, to get models to converge, I have found it helps to apply a nonlinear transformation to the features and then at least two routing layers on top. (I don't have a good explanation as to why two or more tend to work better than only one.) Finally, I usually feed only the capsule activations to the loss function -- that is, during training I let the capsules themselves "do whatever they want" to learn to explain input data.

link

visarga 2242 days ago

If you want interpretability you can use Transformer and look at the attention heads. Or, like in a recent paper, train a language model to give textual justifications for its decision.

link

cs702 2242 days ago

Yes. Been there, done that (by "that" I mean looking at attention heads, not generating verbal "justifications" -- the latter is on my want-to-try-it list, even if only out of curiosity) :-)

FYI, Vaswani-style query-key-value self-attention mechanisms can be understood as a type of capsule-routing algorithm -- one in which the capsules are in the form of vector embeddings (each representing a token in a context), the activations are in the form of attention heads (representing which input tokens are most active for each output token), and the number of input and output capsules is the same (for every input token there is an output token).

Here, I'm talking more generally about using capsule-routing algorithms in which the capsules can be of any shape (they can be vectors, matrices, or higher-order tensors), the activations can be computed via different proposed mechanisms (including self-attention of course), and the number of input and output capsules need not be the same (e.g., with some algorithms it's possible to have a variable number of input capsules and a fixed number of output capsules).

As I wrote elsewhere on this thread, the routing algorithms I find most interesting are those in which each output capsule is a probabilistic model that "must explain input data better than other output capsules" in order for the capsule to activate.[a]

[a] https://news.ycombinator.com/item?id=23067556

link

JoshuaDavid 2242 days ago

> train a language model to give textual justifications for its decision.

This doesn't work for humans. Sure, they'll give an explanation, but they don't fully understand their own decision making process so they can't reliably explain it. I am not sure which paper you're referring to, but how did the researchers address this issue?

link

thomashobohm 2243 days ago

I think it should be obvious why there's no mention-in fact, you said it yourself-"these new routing algorithms are not yet widely used" but are a "promising avenue of research." The purpose of the paper as stated is to help people explain how commonly used deep learning tools work to laymxn, and including an aside about some niche subfield of deep learning research doesn't align with that goal (regardless of how interesting you personally think it is).

link