| HN Mirror

Thank you. You may be right. To some extent we're all guessing based on our own hunches :-)

FWIW, I've had the most success with EM/Heinsen-type routing algorithms -- that is, those in which each output capsule is generated by a probabilistic model (such as a Gaussian mixture), and the output capsule activates only to the extent its model can explain (i.e., generate) its view of input data better (in some quantifiable manner) than other output capsules. The notion that an output capsule "must explain input data better than other capsules in order to activate" is very appealing to me as a mechanism for inducing per-layer "explainability" in models.

In my experience so far, routing tends to work better on top of conventional architectures, e.g., use a ResNet for feature detection and stack two or more routing layers on top for classifying into hidden factors and then into training labels. Also, to get models to converge, I have found it helps to apply a nonlinear transformation to the features and then at least two routing layers on top. (I don't have a good explanation as to why two or more tend to work better than only one.) Finally, I usually feed only the capsule activations to the loss function -- that is, during training I let the capsules themselves "do whatever they want" to learn to explain input data.