Hacker News new | ask | show | jobs
by danharaj 3141 days ago
The problem is max pooling, a common technique, which destroys such information to gain some invariance in the representation.
2 comments

Beyond the initial stages of the network, current SOTA CNNs use strided convolution in addition to (Inception, NASNet) or instead of (ResNet, DenseNet) max pooling. But my impression is that this has more to do with computational efficiency than anything else. Even with max pooling, you can maintain spatial information if you construct the preceding filters properly. But what's important in the example in the post is not the absolute locations of the parts of the face, but the spatial relationships among them, and this is actually something CNNs appear to be reasonably good at handling. CNNs achieve superhuman performance in identifying faces from natural images, so I doubt that a CNN would have trouble telling apart the faces shown in the article.

With that said, I believe that CNNs are merely one approach to understanding images that, given enough data, appears to work quite well. It is quite possible that, by encoding a stronger prior regarding the world into the network architecture, you can accomplish the same goals more accurately with less data. The appeal of the capsules work is that the approach is substantially different from the CNNs that have been tweaked to recognize images over the last 5 years, but still appears to achieve good (and sometimes superior) performance on difficult tasks.

Intuitively this is the idea behind using genetic algorithms encoding a generative network. This gives you a species level architecture evolved for a general class of problems which is then optimized with a learning phase for a more specific problem.
The same problem occurs with avg pooling. Strided conv also allows to "pool" neurons in the layer below to reduce the number of neurons in subsequent layers, but, in practice, deeper neurons then also have trouble learning precise representations of the locations of the things below (but much more info is retained compared to avg/max pooling). Capsules can presumably learn such things much more accurately because they can, in principle, learn precise geometric mappings to infer positions independently of the viewpoint. However, the results so far are not much better than scalar output neurons. Capsules do perform a bit better in terms of robustness against adversarial examples and overlapping objects.
Do you know if anyone's looked at weighted average pooling, e.g. weighted by a Gaussian centred on the middle of the receptive field? It feels like this doesn't throw all the spatial information, but also might not be quite as hard to train as capsule networks?

There are some details I haven't thought throw on this, but I'd imagine you'd want your stride length to be around the standard deviation of the Gaussian.

Any pointers to papers on this (or comments on why this obviously won't work) would be very welcome - I'm still trying to develop my intuition on all this!

You'd also lose most of the information. If there is only a single active neuron among the inputs to a Gaussian kernel neuron, you would at least have info about the distance of that to the center of the receptive field, but no directionality. If there are multiple active neurons among the inputs, you'd lose most distance-to-center info. Basically imagine avg pooling as spatial downsampling by box filter or surface area integration, and Gaussian pooling as downsampling by Gaussian filtering.
Thanks! I agreed with the intuition around spatial downsampling.

I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

Either way, thanks for your comment!

> I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

I think, it would still be a very "blunt tool" for feature detection. If you are going to compute weighted sums in a convolution anyway (as opposed to just summation in avg pooling or maximum search in max pooling), then the question is really why not simply learn arbitrary feature detectors instead of fixed Gaussian kernels? You can separate Gaussian kernels in x and y direction, which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2 operations (with image size N and kernel size K), but in practice, that probably won't give you enough improvement to make up for how few bits of information a Gaussian filter can extract. You would also need to use a very strong sparsity regularizer to get few enough active neurons in the previous layer such that that multiple Gaussians can infer a location. I am not entirely sure it would not work, maybe it is worth a try.

> If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

That is a very good point. In neuro lingo, this aliasing is called "crowding". A multi-channel filter kernel (as in standard CNNs) can in principle deal with that by learning filters for representing multiple entities in different spatial configurations within the receptive field, but that requires large amounts of filters and spatial codes which are also not trainable very well in CNNs. Capsules can indeed only represent one entity within their respective receptive fields. I think, capsules fail more gracefully in case of crowding than standard CNNs because the agreement detection can decide on one out of multiple objects being predicted by the capsules below.

Thanks for your very informative reply - definitely more for me to read up on!
The major advantage proposed for capsule networks is the ability to train off far fewer number of observations, not necessarily the full accuracy. At this point CNN's are consistently approaching or even exceeding human levels of accuracy, and thus be benefiting from a slower but more accurate methodology that relies on far more training data.
Skimming the two papers I could not find any figure about data efficiency. Did you?
The specific instance I was remembering was from interviews Hinton's given about these papers, but this is the section of the arXiv paper that's relevant:

>Now that convolutional neural networks have become the dominant approach to object recognition, it makes sense to ask whether there are any exponential inefficiencies that may lead to their demise. A good candidate is the difficulty that convolutional nets have in generalizing to novel viewpoints. The ability to deal with translation is built in, but for the other dimensions of an affine transformation we have to chose between replicating feature detectors on a grid that grows exponentially with the number of dimensions, or increasing the size of the labelled training set in a similarly exponential way. Capsules (Hinton et al. [2011]) avoid these exponential inefficiencies by converting pixel intensities into vectors of instantiation parameters of recognized fragments and then applying transformation matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011] proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule layer and their system required transformation matrices to be supplied externally. We propose a complete system that also answers "how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules".

More broadly speaking, the benefit of being able to recognize slightly transformed viewing angles leads to dramatically fewer needed training observations that are still clearly identifiable as the same object.

So basically there's zero evidence in the paper that capsule networks require fewer training examples, correct?
I think that is correct because otherwise they would have mentioned it as an outstanding feature of the model. It does require fewer parameters than a CNN to reach the same accuracy.