Hacker News new | ask | show | jobs
by maffydub 3141 days ago
Thanks! I agreed with the intuition around spatial downsampling.

I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

Either way, thanks for your comment!

1 comments

> I was thinking that the next layer in the network would respond to multiple samples (i.e. convolutions of the Gaussian at different positions) and, as long as you didn't have too many active neurons on the previous layer, it could extract a measure of position.

I think, it would still be a very "blunt tool" for feature detection. If you are going to compute weighted sums in a convolution anyway (as opposed to just summation in avg pooling or maximum search in max pooling), then the question is really why not simply learn arbitrary feature detectors instead of fixed Gaussian kernels? You can separate Gaussian kernels in x and y direction, which allows you to compute it in 2 * N^2 * K + N^2 instead of N^2 * K^2 operations (with image size N and kernel size K), but in practice, that probably won't give you enough improvement to make up for how few bits of information a Gaussian filter can extract. You would also need to use a very strong sparsity regularizer to get few enough active neurons in the previous layer such that that multiple Gaussians can infer a location. I am not entirely sure it would not work, maybe it is worth a try.

> If you have too many active neurons then, as you say, you encounter aliasing effects, but I think the same is true with capsule networks - they're not expected to handle particularly high-frequency features, are they?

That is a very good point. In neuro lingo, this aliasing is called "crowding". A multi-channel filter kernel (as in standard CNNs) can in principle deal with that by learning filters for representing multiple entities in different spatial configurations within the receptive field, but that requires large amounts of filters and spatial codes which are also not trainable very well in CNNs. Capsules can indeed only represent one entity within their respective receptive fields. I think, capsules fail more gracefully in case of crowding than standard CNNs because the agreement detection can decide on one out of multiple objects being predicted by the capsules below.

Thanks for your very informative reply - definitely more for me to read up on!