|
|
|
|
|
by boxy310
3141 days ago
|
|
The specific instance I was remembering was from interviews Hinton's given about these papers, but this is the section of the arXiv paper that's relevant: >Now that convolutional neural networks have become the dominant approach to object recognition, it
makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A
good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints. The
ability to deal with translation is built in, but for the other dimensions of an affine transformation
we have to chose between replicating feature detectors on a grid that grows exponentially with the
number of dimensions, or increasing the size of the labelled training set in a similarly exponential way.
Capsules (Hinton et al. [2011]) avoid these exponential inefficiencies by converting pixel intensities into vectors of instantiation parameters of recognized fragments and then applying transformation
matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation
matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute
viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011]
proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule
layer and their system required transformation matrices to be supplied externally. We propose a
complete system that also answers "how larger and more complex visual entities can be recognized
by using agreements of the poses predicted by active, lower-level capsules". More broadly speaking, the benefit of being able to recognize slightly transformed viewing angles leads to dramatically fewer needed training observations that are still clearly identifiable as the same object. |
|