Hacker News new | ask | show | jobs
by keithyjohnson 2461 days ago
Understanding a sentence is fundamentally different from recognizing an object. But people are trying to use deep learning to do both.

I agree with most of the article but I think this^^ skips over the different types of networks used to solve perception and language problems. A CNN is very different from say, word2vec, which isn't a very deep network at all.

2 comments

I’d go further and say that deep networks are excellent for sentence understanding, and various types of RNN or 1D convolutional layers are very good at this in specialized domains just as CNNs and ResNets are good in specialized vision applications.

It absolutely makes sense to use deep learning for both of these tasks.

In fact, one very effective thing to do is to use a Siamese network to learn joint representational spaces of text and imagery in the same network.

It’s really specious and disingenuous to say “boy, vision and language sure seem different but can you believe these DL researchers are using the same tools for both!?”

or... "vision and language sure seem different, can you believe that networks of neurons in the brain do both?"
Perhaps the difference is in the nature of the information that is being probed and its larger context? Visual imagery often provides almost all of its own context, but the “meaning” of a sentence can be radically different depending upon its source. Humans produce words, so you almost need a working theory of mind to fully understand them. None of that context will ever make it into word2vec.
Why not? It's true that the word "space" has different meanings when in appears in a math book, CS book or astronomy book. But we just have 3 different word2vec models. When I read something about math, I pick the math word2vec model and there "space" appears close to words "Hilbert" and "separable", while in the CS model, the same word is next to "complexity" and "memory". As I read more, I improve my word2vec models, but never mix them together. Now what happens if I'm reading something and don't understand the context? No, I don't switch to some general word2vec model. I rather try to guess which model to use and then reread the same text using that model.