| > How does a regular convnet do on another domain? As far as I know, regular convnets have failed to outperform query-key-value self-attention models (i.e., transformers based on Vaswani et al.'s work) on pretty much every sequence task, including natural language tasks. > Learning to do “reverse graphics” is only useful if you can show it is the reason behind performance improvement. I would strongly disagree. Building systems that can learn "reverse graphics" on their own has long been a goal of computer vision. It seems a prerequisite for building machines that can build internal representations of the state of the physical world around them. Hinton et al.'s 2018 paper has a summary of recent efforts on this front on the "Related Work" section. > What I’m saying is - no one has yet demonstrated a clear superiority of any capsules based model to the best available plain convnet. No one is saying otherwise. :-) Convnets are still the right tool for most production systems in visual recognition today. That said, I don't think a convnet can achieve 99.1% accuracy on smallNORB with only 272K parameters, after training from scratch without using any additional data or metadata of any kind -- like the model using my routing algorithm. If you think you can do that with a convnet, do it and put it up online (I'd love to see it :-) |
Re learning reverse graphics - ok, maybe it is indeed the main feature of your work. I’d need to look into that, because from skimming your paper it’s not immediately clear what’s going on there.
Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10 as soon as you have the results.