Hacker News new | ask | show | jobs
by ergodic 4838 days ago
Well, it is definitely something but it being the "Breakthrough of the Decade" seems pretty unlikely to me (given my available evidence).

I do not know well other examples beyond case of Automatic Speech Recognition, but since this case caused a lot of noise, I bet it is responsible for a reasonable chunk of the Deep learning "buzz". Here is my take about this.

If you look at papers from Microsoft like Seide et al 2011 and similar papers the reported improvement against state of the art (up to 30%) is really impressive and seems solid. Now, the technique is more or less using a very big multi-layer perceptron (MLP), a technique already established two decades ago (or more). There is some fancy stuff like the deep belief network based initialization, but it does not make big differences. The core of the recipe itself is not very new. What has changed is the scale of data we have available and the size of the models that we can handle.

With this I am not implying that this is not a very interesting discovery. But it is important to bear in mind that the change in the amount of data could also make other 20 year old techniques interesting again. On the other hand, neural networks had a bad name in the last years for understandable reasons. They are a blackbox, or at least less transparent than the statistical methods. This makes them prone to cause the "black box delusion" effect. You hear a new algorithm is in town, it has fancy stuff like remotely resembling human thinking architectures or cool math but you can not completely grasp it guts, then "voila!" suddenly you are overestimating its relevance and scope of applicably. MLPs were hailed as "the" tool for machine learning already once, I think for these same reasons. For me the right position here is a prudent skepticism.

On the other hand, this should also push people to try new/old radical stuff since the rules of the game seem to be changing, it is not a moment to be conservative in ML research :).

4 comments

I've heard this argument ever since Norvig's Unreasonable Effectiveness of Data. While having a ton of data available is great, it has its limits. I believe you are overestimating the effectiveness of data (as, imo, Norvig did). And here specifically, it's not the case for the hype:

from the NYT article [1]: "The achievement was particularly impressive because the team decided to enter the contest at the last minute and designed its software with no specific knowledge about how the molecules bind to their targets. The students were also working with a relatively small set of data; neural nets typically perform well only with very large ones."

NNs in general have enjoyed lots of successful practical (commercial) applications in pattern recognition though they were sort of replaced in the "state-of-the-art" by SVMs in many cases until RBMs and DBNs came along. I agree with your caution for skepticism though, only time will tell how good DBNs are.

I think the black box criticism is BS for the most part. In some cases (google's search being a famous example) it might be great to have a human readable and tweakable solution (assuming you have the resources) but for something like recognising handwritten digits from images, not so much.

[1] http://www.nytimes.com/2012/11/24/science/scientists-see-adv...

Regarding the black box criticism, it seems to me that most popular algorithms (SVM, Random forest, ...) become black boxes once you go past the simple 2D example and apply them to real problems. Real-world decisions trees are pretty unreadable and include some rules that really don't make more sense than the weights in a neural network.
> it might be great to have a human readable and tweakable solution (assuming you have the resources) but for something like recognising handwritten digits from images, not so much.

Agree, but with black-box I meant not something that is opaque to my grand-mother but partially opaque to engineers that implement MLP machine learning applications and the tech-lead that takes the decisions. The thing is that even research people (or maybe specially them) tend to positively bias things they do not completely understand (so I think, maybe its just me ;)). That is what I meant with black-box delusion. As you say only time will tell.

Regarding DBNs, again, the case of ASR uses DNNs which is to say big-fat MLPs. The model is handled as a DBN only for pre-training, and layer-wise pre-training does a similar job anyway.

Regarding the "black-box delusion", it's not just you. You see a magician do a trick, and it's amazing. Then he explains how it is done, and the excitement vanishes. Oh, that's all it is, no big deal.

Any sufficiently advanced technology is indistinguishable from magic, and who knows what wonders magic might accomplish? But once you understand the "trick", it's obvious that it can't do much more than what it's doing. Oh, well. The magic is gone.

I have no idea if it is the breakthrough of the decade, but I think deep learning isn't just taking a perceptron with many hidden layers and applying backpropagation to it, as you seem to say, all the interesting things about it you summarized as "fancy stuff" and "not making a big difference", without any context, references or arguments. I do not feel competent to discuss it as I have very little experience in this field, but it doesn't feel too informed even given whatever little knowledge I have. Certainly faster computers and more data have helped, but just like in traditional algorithms research, they cannot completely make up for having exponential growth functions with respect to computational needs of the amount of data required. There have been large improvements in both respects in the deep learning community, in fact rarely does the term "deep learning" refer in practice to traditional completely supervised learning that you are talking about.

There are nice and more balanced overviews here:

http://ufldl.stanford.edu/wiki/index.php/Deep_Networks:_Over...

http://en.wikipedia.org/wiki/Deep_learning

If it was not clear enough, "fancy stuff" and "not making a big difference" refers to Seide et al 2011 mentioned in the same paragraph. Table 2 is particularly revealing to this regard.

http://research.microsoft.com/apps/pubs/default.aspx?id=1531...

As I said I can only speak with more or less certainty regarding ASR. I am fairly sure that the success in ASR (with Google and MS embracing DNNs for ASR) contribute significantly to the mainstream impact of deep learning.

There is a second paper where they specifically point out the differences between their approach and previous approaches using neural networks and it isn't only the number of layers that has changed but also the internal architecture of the network, the "responsibilities" of the layers, so again, it isn't just a traditionally trained MLP with a lot of layers:

http://research.microsoft.com/pubs/157341/FeatureEngineering...

I read it in diagonal but the paper seems to use the same DNN architecture as before. They seem to tweak the pretraining with layer-wise back-propagation (instead of full MLP-as-DBN pre-training). This does not imply anything new with respect to what I commented and the cited paper.

The only reference to differences I found is about differences between a DNN and a MaxEnt models, which is again not an argument for differences between DNNs and MLPs.

Could you point me to a concrete paragraph?, I would be happy to be mistaken in this regard.

DNNs can be thought of a stacked Restricted Boltzmann Machines. Their structure and training is very different to traditional MLPs. They derive in some ways from convolutional neural nets.

I describe some of the key differences between DNNs and MLPs in the webinar. Also, the webinar explains how recent advances go far beyond just applications to speech recognition - in particular I focus on a case study in chemoinformatics.

>DNNs can be thought of a stacked Restricted Boltzmann Machines

Agree, as explained in Hinton et al 2006.

http://www.cs.toronto.edu/~hinton/absps/ncfast.pdf

But this is just for pre-training, as I said. If you look at Seides paper, they pre-train treating the MLP as a DBN and then they train it as a classic MLP with BP. Also using layer-wise BP pre-training does bring performance close to DBN pre-training, with no use of DBNs paradigms at all.

>Their structure and training is very different to traditional MLPs

I insist if we are talking of the same DNNs explained in Microsofts paper, this is not true. If we were to be talking about different DNNs please elaborate I would love to hear about that (seriously, no irony here).

In comparison to older MLP research, besides the new training algorithm, there is this new insight that the deep structure of the network might be efficient for generating very good encodings of the input variables, like described here:

http://en.wikipedia.org/wiki/Autoencoder

I am not very familiar with speech recognition, but I think what they talk about here:

Instead of factorizing the networks, e.g., into a monophone and a context-dependent part [5], or decomposing them hierarchically [6], CD-DNN-HMMs directly model tied context-dependent states (senones). This had long been considered ineffective, until [1] showed that it works and yields large error reductions for deep networks.

might be related to this fact. 20 years ago it wasn't known why would you pick a deep network instead of a shallow one, there was even this famous theorem of Kolmogorow that a lot of people in ML misunderstood, that a network with just one hidden layer can in theory learn any function with arbitrary precision.

Again, the use of senones instead of monophones or diphones is just changing the output targets is not a novelty per sé.
The thing that NNs have in their favor that other "20 year old techniques" lack is their ability to model any mathematical equation. There is no fundamental limit to the complexity of systems NNs can model (as there is with other AI techniques).

The problem with NNs is the difficulty of training them. Back propagation with random initial weights is simple, but it can easily converge on suboptimal local maximum if the learning rate is too aggressive. On the other hand, a slow learning rate requires an exponential increase in training time and data. Back propagation as a method was never really broken, it simply wasn't efficient enough to be effective in most situations. Deep belief techniques seem to remedy these inefficiencies in a significant way, while remaining a generalized solution.

Essentially deep belief networks seem to optimize NNs to the point where new problems are now approachable, and greatly improve the performance of current NN solvable problems. The complaint that "the core of the recipe itself is not very new", seems irrelevant in light of the results.

>The thing that NNs have in their favor that other "20 year old techniques" lack is their ability to model any mathematical equation. There is no fundamental limit to the complexity of systems NNs can model (as there is with other AI techniques).

I'm sure that a decision tree can also be viewed as a [universal approximator](http://en.wikipedia.org/wiki/Universal_approximation_theorem) if you let tree height go to infinity (just as you need to let layer size grow unbounded with a NN). In practice, this power is at best irrelevant and often actually a liability (you have to control model complexity to prevent overfitting/memorization).

And, importantly, being able to theoretically encode any function within your model is not the same as having a robust learning algorithm that will actually infer those particular weights from a sample of input/output data.

Again, please, have a look at Seide et al 2011 before commenting. Besides that I am not complaining, just saying, wait a little more before you claim the breakthrough of the decade.
>There is no fundamental limit to the complexity of systems NNs can model (as there is with other AI techniques).

Sure there is. For example, they will never solve the halting problem. They will also (probably) never solve NP-complete problems for very large instances.

While deep learning is a very cool technique and is currently getting the best results in a few domains I think all the hype may become a problem. I was around for the prior round of neural network excitement and much time, effort and money was wasted. In that case it turned out that other techniques were more tractable and thus easier to use and improve upon.

It must be the association with the human brain that just makes neural networks more exciting than other techniques. But dispite the appeal of imitating nature has this usually been the easiest way to make progress in the past? Seems like it would be harder to achieve both goals at the same time.

So far the results are looking pretty good but it is probably best to keep the hype at a reasonable level unless it is crucial of your business model. ;)

I have a machine learning startup that is using deep learning neural networks, so I'm probably biased here. I really think there is something that is worth the hype here, this is the first time we can solve significant problems without lots of feature engineering to make the neural network be able to solve the problem. While I'm sure there are going to be tons of things that deep belief neural networks can not do well even with these new capabilities and breakthroughs, there is a crapload of data out there that is begging to be analysed. Being able to get reasonable performance without a ton of feature engineering and years for a black arts team to build something that can get the data into a state where problems can be answered is SUPER exciting. The Neural Networks we are using are more specialized and more like Yann Lecun's, and we aren't using dropout like Hinton but we already have something that gets very good accuracy in our problem domain. There are some new techniques that are just coming out of Montreal, one in particular I'm very excited about called Maxout that looks like it will be another significant advance. One of the problems networks like this usually have is that the activation functions saturate above a certain level and once a neuron is in the saturated state the gradient training process will not move it anymore. Maxout is different in that it doesn't have this property, and it seems to maximize the benefit of the random selection process of dropout.

While I don't have the math credentials to match Hinton I think as more 'normal' folks like me get into the game there will also be some interesting things going on. We are trying some interesting things that seem very promising, and I'm sure there are lots of other folks beginning to play with these things that will have some interesting ideas and approaches as well.

So I personally think this is super exciting, and while it might not be applicable for every problem Deep Learning will definitely have a big impact.

>I was around for the prior round of neural network excitement and much time, effort and money was wasted. In that case it turned out that other techniques were more tractable and thus easier to use and improve upon.

And before 1980s style neural networks there were 1950s perceptrons. That was a much bigger mess, it took more than ten years for someone to point out how 'dumb' perceptrons were (they couldn't even model an XOR), which led to a collapse in AI funding that lasted more than 25 years.

Can we be a little more thoughtful this time and avoid the boom and bust cycle that so often leads to problems?

You would think that since it already happened with neural networks before it would be less likely to happen again. However it may be that the same factors that lead to the last cycle are still in operation and it is actually more like to happen again. Something like the reasons for the seemingly endless series of real estate bubbles.