| I’m surprised you did not comment on the fact that my version of EM routing also achieves SOTA on another domain, natural language. Same code. Here are the answers to your questions: 1. The final, published version is stamped “ICLR 2018,” so I used that year. 2. I don’t know if a conventional CNN can do this with 10x fewer parameters, while also learning to do a form of “reverse graphics” without explicitly optimizing for it. (I wouldn’t know how to get a CNN to do that without explicitly making it a training objective.) 3. IIRC, the convnet model from 2011 accepts 96x96 images. As to why Hinton et al. downsample images to 9x smaller, I suspect (but don’t know for sure) they had no choice to conserve memory and computation using their version of EM routing. I was able to reduce memory and computation with my variant of EM routing (by between one and two orders of magnitude) by setting the first routing layer to accept a variable number of inputs, without regard to location in image. 4. Me too. But you asked me about work other than Hinton’s, and that’s all I could find! 5. CIFAR10 is on the to-do list (work permitting!) :-) |
Learning to do “reverse graphics” is only useful if you can show it is the reason behind performance improvement, compared to a plain convnet. Until we have cifar-10 results it’s not clear.
What I’m saying is - no one has yet demonstrated a clear superiority of any capsules based model to the best available plain convnet. Even on cifar-10. Looking forward to your results!