Hacker News new | ask | show | jobs
by boxy310 3136 days ago
The major advantage proposed for capsule networks is the ability to train off far fewer number of observations, not necessarily the full accuracy. At this point CNN's are consistently approaching or even exceeding human levels of accuracy, and thus be benefiting from a slower but more accurate methodology that relies on far more training data.
1 comments

Skimming the two papers I could not find any figure about data efficiency. Did you?
The specific instance I was remembering was from interviews Hinton's given about these papers, but this is the section of the arXiv paper that's relevant:

>Now that convolutional neural networks have become the dominant approach to object recognition, it makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints. The ability to deal with translation is built in, but for the other dimensions of an affine transformation we have to chose between replicating feature detectors on a grid that grows exponentially with the number of dimensions, or increasing the size of the labelled training set in a similarly exponential way. Capsules (Hinton et al. [2011]) avoid these exponential inefficiencies by converting pixel intensities into vectors of instantiation parameters of recognized fragments and then applying transformation matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011] proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule layer and their system required transformation matrices to be supplied externally. We propose a complete system that also answers "how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules".

More broadly speaking, the benefit of being able to recognize slightly transformed viewing angles leads to dramatically fewer needed training observations that are still clearly identifiable as the same object.

So basically there's zero evidence in the paper that capsule networks require fewer training examples, correct?
I think that is correct because otherwise they would have mentioned it as an outstanding feature of the model. It does require fewer parameters than a CNN to reach the same accuracy.