Hacker News new | ask | show | jobs
by cs702 2243 days ago
Not sure I agree, for two reasons. First, capsule networks have been shown to outperform standard architectures in at least some tasks (see the above papers). Second, and perhaps more importantly, the large and growing chorus of people -- from corporate executives to government regulators -- asking for models that are "explainable" and "interpretable" really couldn't care less as to what kinds of models are used. (In my experience, non-technical people with decision-making power are almost always willing to trade performance for better explainability/interpretability/assignment.)
2 comments

My personal experience with capsule networks is that they didn't work better than a similar number of ungrouped neurons in any case.

If capsules work wonders for you, my first guess would be that you can improve your training of the standard network to make it work equally well.

In general, my hunch is that capsules are still too low level and too much of a local change to make a strong difference.

To give an example, all of the state of the art optical flow AIs are based on building cost volumes and then resolving them. There are edge cases, where one can prove mathematically that reducing the cost volume to a flow direction will make it impossible to produce the correct result. So to make a significant contribution, it doesn't help to use capsules in the feature processing stage, but you need to replace the entire architecture.

Thank you. You may be right. To some extent we're all guessing based on our own hunches :-)

FWIW, I've had the most success with EM/Heinsen-type routing algorithms -- that is, those in which each output capsule is generated by a probabilistic model (such as a Gaussian mixture), and the output capsule activates only to the extent its model can explain (i.e., generate) its view of input data better (in some quantifiable manner) than other output capsules. The notion that an output capsule "must explain input data better than other capsules in order to activate" is very appealing to me as a mechanism for inducing per-layer "explainability" in models.

In my experience so far, routing tends to work better on top of conventional architectures, e.g., use a ResNet for feature detection and stack two or more routing layers on top for classifying into hidden factors and then into training labels. Also, to get models to converge, I have found it helps to apply a nonlinear transformation to the features and then at least two routing layers on top. (I don't have a good explanation as to why two or more tend to work better than only one.) Finally, I usually feed only the capsule activations to the loss function -- that is, during training I let the capsules themselves "do whatever they want" to learn to explain input data.

If you want interpretability you can use Transformer and look at the attention heads. Or, like in a recent paper, train a language model to give textual justifications for its decision.
Yes. Been there, done that (by "that" I mean looking at attention heads, not generating verbal "justifications" -- the latter is on my want-to-try-it list, even if only out of curiosity) :-)

FYI, Vaswani-style query-key-value self-attention mechanisms can be understood as a type of capsule-routing algorithm -- one in which the capsules are in the form of vector embeddings (each representing a token in a context), the activations are in the form of attention heads (representing which input tokens are most active for each output token), and the number of input and output capsules is the same (for every input token there is an output token).

Here, I'm talking more generally about using capsule-routing algorithms in which the capsules can be of any shape (they can be vectors, matrices, or higher-order tensors), the activations can be computed via different proposed mechanisms (including self-attention of course), and the number of input and output capsules need not be the same (e.g., with some algorithms it's possible to have a variable number of input capsules and a fixed number of output capsules).

As I wrote elsewhere on this thread, the routing algorithms I find most interesting are those in which each output capsule is a probabilistic model that "must explain input data better than other output capsules" in order for the capsule to activate.[a]

[a] https://news.ycombinator.com/item?id=23067556

> train a language model to give textual justifications for its decision.

This doesn't work for humans. Sure, they'll give an explanation, but they don't fully understand their own decision making process so they can't reliably explain it. I am not sure which paper you're referring to, but how did the researchers address this issue?