Hacker News new | ask | show | jobs
by fheinsen 2428 days ago
I’m surprised you did not comment on the fact that my version of EM routing also achieves SOTA on another domain, natural language. Same code.

Here are the answers to your questions:

1. The final, published version is stamped “ICLR 2018,” so I used that year.

2. I don’t know if a conventional CNN can do this with 10x fewer parameters, while also learning to do a form of “reverse graphics” without explicitly optimizing for it. (I wouldn’t know how to get a CNN to do that without explicitly making it a training objective.)

3. IIRC, the convnet model from 2011 accepts 96x96 images. As to why Hinton et al. downsample images to 9x smaller, I suspect (but don’t know for sure) they had no choice to conserve memory and computation using their version of EM routing. I was able to reduce memory and computation with my variant of EM routing (by between one and two orders of magnitude) by setting the first routing layer to accept a variable number of inputs, without regard to location in image.

4. Me too. But you asked me about work other than Hinton’s, and that’s all I could find!

5. CIFAR10 is on the to-do list (work permitting!) :-)

1 comments

How does a regular convnet do on another domain?

Learning to do “reverse graphics” is only useful if you can show it is the reason behind performance improvement, compared to a plain convnet. Until we have cifar-10 results it’s not clear.

What I’m saying is - no one has yet demonstrated a clear superiority of any capsules based model to the best available plain convnet. Even on cifar-10. Looking forward to your results!

> How does a regular convnet do on another domain?

As far as I know, regular convnets have failed to outperform query-key-value self-attention models (i.e., transformers based on Vaswani et al.'s work) on pretty much every sequence task, including natural language tasks.

> Learning to do “reverse graphics” is only useful if you can show it is the reason behind performance improvement.

I would strongly disagree. Building systems that can learn "reverse graphics" on their own has long been a goal of computer vision. It seems a prerequisite for building machines that can build internal representations of the state of the physical world around them. Hinton et al.'s 2018 paper has a summary of recent efforts on this front on the "Related Work" section.

> What I’m saying is - no one has yet demonstrated a clear superiority of any capsules based model to the best available plain convnet.

No one is saying otherwise. :-) Convnets are still the right tool for most production systems in visual recognition today.

That said, I don't think a convnet can achieve 99.1% accuracy on smallNORB with only 272K parameters, after training from scratch without using any additional data or metadata of any kind -- like the model using my routing algorithm. If you think you can do that with a convnet, do it and put it up online (I'd love to see it :-)

You’re comparing sentence classification done using transformer embeddings to older results which use inferior embeddings. How do regular convnets do when you feed them transformer embeddings?

Re learning reverse graphics - ok, maybe it is indeed the main feature of your work. I’d need to look into that, because from skimming your paper it’s not immediately clear what’s going on there.

Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10 as soon as you have the results.

> You’re comparing sentence classification done using transformer embeddings to older results which use inferior embeddings. How do regular convnets do when you feed them transformer embeddings?

Actually, I'm comparing it to recent models, including XLNet, MT-DNN, Snorkel, and (of course) BERT. AFAIK, convnets have not been able to outperform multihead self-attention, even on pretrained embeddings.

> Re learning reverse graphics - ok, maybe it is indeed the main feature of your work. I’d need to look into that, because from skimming your paper it’s not immediately clear what’s going on there.

I agree, it's not immediately clear. Nonetheless, I find it kind of unbelievable that a model with so few parameters can seem to do it. (I was shocked when I first saw the plots.)

> Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10 as soon as you have the results.

That's a little disappointing... but OK.

Thank you so much for all your questions :-)

Ah, I missed table 4 with the recent models. I looked closer and it does look impressive, however you should ask someone who worked on that task to review your experiments (I haven’t).

Actually, it looks like you got a solid paper. I recommend submitting either to CVPR or ICML, especially if you can get good results on cifar.

Thank you!

Yes, I think this has legs.

Maximizing "bang per bit" (a) seems truly a new idea, as opposed to some minor tweak on the same old thing, and (b) the evidence so far shows it works better than previous methods.

(FWIW, we've been using this algorithm internally at work with similar outperformance over other methods, in yet another domain that is neither vision nor language... but I cannot share those results publicly.)

Before submitting this anywhere, I'd like to get more informal feedback from other AI researchers. I've reached out to people at Google Brain, Facebook AI, DeepMind, OpenAI, and a handful of top academic institutions and research groups. So far, the response has been positive, but I expect it will take everyone at least a couple of weeks, and probably longer, to read and understand the draft paper in sufficient detail to give me more than superficial comments.

New things often look like toys at first. :-)