Hacker News new | ask | show | jobs
by ergodic 4838 days ago
I read it in diagonal but the paper seems to use the same DNN architecture as before. They seem to tweak the pretraining with layer-wise back-propagation (instead of full MLP-as-DBN pre-training). This does not imply anything new with respect to what I commented and the cited paper.

The only reference to differences I found is about differences between a DNN and a MaxEnt models, which is again not an argument for differences between DNNs and MLPs.

Could you point me to a concrete paragraph?, I would be happy to be mistaken in this regard.

2 comments

DNNs can be thought of a stacked Restricted Boltzmann Machines. Their structure and training is very different to traditional MLPs. They derive in some ways from convolutional neural nets.

I describe some of the key differences between DNNs and MLPs in the webinar. Also, the webinar explains how recent advances go far beyond just applications to speech recognition - in particular I focus on a case study in chemoinformatics.

>DNNs can be thought of a stacked Restricted Boltzmann Machines

Agree, as explained in Hinton et al 2006.

http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf

But this is just for pre-training, as I said. If you look at Seides paper, they pre-train treating the MLP as a DBN and then they train it as a classic MLP with BP. Also using layer-wise BP pre-training does bring performance close to DBN pre-training, with no use of DBNs paradigms at all.

>Their structure and training is very different to traditional MLPs

I insist if we are talking of the same DNNs explained in Microsofts paper, this is not true. If we were to be talking about different DNNs please elaborate I would love to hear about that (seriously, no irony here).

There's also the random knockout of neurons, as mentioned in the webinar.
I did not find that on the paper, are you referring to randomly switching off neurons?. I would be surprised if this would not be a technique of the original neural networks wave.
In comparison to older MLP research, besides the new training algorithm, there is this new insight that the deep structure of the network might be efficient for generating very good encodings of the input variables, like described here:

http://en.wikipedia.org/wiki/Autoencoder

I am not very familiar with speech recognition, but I think what they talk about here:

Instead of factorizing the networks, e.g., into a monophone and a context-dependent part [5], or decomposing them hierarchically [6], CD-DNN-HMMs directly model tied context-dependent states (senones). This had long been considered ineffective, until [1] showed that it works and yields large error reductions for deep networks.

might be related to this fact. 20 years ago it wasn't known why would you pick a deep network instead of a shallow one, there was even this famous theorem of Kolmogorow that a lot of people in ML misunderstood, that a network with just one hidden layer can in theory learn any function with arbitrary precision.

Again, the use of senones instead of monophones or diphones is just changing the output targets is not a novelty per sé.