Hacker News new | ask | show | jobs
by iamwil 2721 days ago
I have a question!

In the pdf, it said that the optimization problem in SVMs have a nice property in that it was quadratic, which means that there's a nice global minimum to go towards, and not lots of local minimum like in NN. That means, it seems SVMs won't get stuck at a suboptimal solution.

Is that not a problem in DNNs now? Or is it that it's such high dimensionality that local minima don't stop the optimizer, because there's always another way around the local minimum?

6 comments

Still a problem, but not a problem at the same time. Downhill in the loss function is always better, so you may not be at the global minima but you might be at a good enough spot anyway. Using SGD gives you a bit of local minima hopping ability as well. Interesting question to think about what the loss surface looks like, as it is so high dimensional that it might just be saddles everywhere. The difference between a local minima and the global might be practically nothing in terms of classifier performance. The convexity of the loss function for SVMs is a mixed blessing, you are guaranteed to be at the global min but the optimization problem doesn't scale with large training data or big feature vectors. Hence the historical feature engineering efforts or sacrificing this property to a more scalable optimization method. So in short the ability to use high dimensional features in a NN means you don't get any guarantee about if the minima you get will be the best, but you lose less by not having to cut down the size of the input vector by hand (i.e. working directly with an image rather than some embedding with keypoint descriptors etc).
As others have said, you don't actually want the global optimum of a neural network because that would be terrible overfitting. There is some evidence that architectural tricks (like ResNet) that empirically help performance are making the loss landscape "more convex", though.

https://arxiv.org/pdf/1712.09913.pdf

SVMs are a convex optimization problem, which means any local optimum is also globally optimal. In practice that leads to fast, provably optimal solutions.

With deep learning, the problem is non-convex, so the results are locally optimal but not necessarily globally optimal. We get around this with different techniques such as randomized initial weights, but with few exceptions, non-convex problems are NP-complete. Therefore there is no polynomial-time algorithm for finding the global optimum in deep learning unless P==NP.

I occasionally hear people claim deep learning finds the globally optimal solution, but that’s just not correct (or they mean they are doing an exhaustive search e.g. branch and bound, which has exponential worst case run time and isn’t practical).

That said, locally optimal solutions are often “good enough” for practical problems.

Its a small mercy that one does not find the global minimum of DNNs. Given their number of parameters and no explicit regularization term, they would overfit viciously. The inability to reach the global and the jiggling around imposed by stochastic gradient descent acts as implicit regularization
SVMs have convex objective functions, but when people use SVMs, they are using some kind of features + SVMs on top. The success of the approach is both good features, and waiting long enough for the optimizer to converge.

With DNNs, people learn everything (features + decision boundary), and this problem is not convex. Surprisingly DNNs work quite well in practice even though we were taught to be afraid of non-convex problems in grad school around 2005.

If back in early 2000s, we stopped worrying about theoretical issues and explored more approaches like ConvNets, we might have had the deep learning revolution 10 years earlier.

You usually do not even want a global optimum with a high capacity model such as a DNN anyway (SVMs memorize only a small number of data points), because that possibly means that you are overfitting to your training dataset.
People who know more about deep learning than I do tend to argue that there is empirical evidence that non-convexity is a non-issue because the performance of local minima will be close to the performance of the global minimum, given a sufficient number of nodes[1][2]. One such quote:

    I once ran a small neural net 100 times on simple three-dimensional data re- 
    selecting the initial weights to be small and random on each run. I found 32 
    distinct minima, each of which gave a different picture, and having about equal 
    test set error.

    – Leo Breiman
[1]: https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...

[2]: https://stats.stackexchange.com/questions/203288/understandi...

Note also that "optimal" in each case means "relative to other models in the same family." But a large neural net and an SVM do not exist in the same hypothesis space. As an analogy, recall that Ordinary Least Squares is the BLUE (Best Linear Unbiased Estimator.) But that's only relative to other unbiased (which in particular means no regularization: not ridge/lasso/elasticnet) linear models over the same feature set. Just because OLS provably gives the best performance in this family doesn't mean it is going to do well compared to models outside that family. In the same way, just because SVM offers a convergence guarantee doesn't mean its performance will always be the same or better than an NN.

The real problem with SVMs is that when you use a kernel (which is the whole point, linear SVMs are just logistic regression with hinge loss) then you introduce one landmark (feature) for every data point. So if your original data set was a manageable 1e6x1e3 then the SVM will view this data set as 1e6x1e6. It doesn't actually have to store that in memory thanks to the kernel trick[3] but training time still scales as O(N^2) where N is the number of observations (rows) in the original data. In practice, even with a highly optimized library like LIBSVM[4] you're not going to get good performance past N=1e6. (I've personally had the most success with SVMs on small datasets of only a few thousand rows.) While NN can very easily accommodate much later training sets with training time only growing as O(N), more or less. You can always sample a manageable training set from a larger data set but if your only using 1% of your data to train a model that's a problem.

Deep NN is also a far more modular approach: if you know your data has a 2D structure for example, you can add some CNN layers and get translational invariance. Deep NN comes with a large toolbox for specializing the model to the data, while the options for tuning an SVM are much more limited: you can change out the kernel function (not that anything is likely to beat RBF) and play around with the regularization parameter, and that's it. Once you've tuned those hyper-parameters with a small grid search there's not much else left to try with SVMs. If they work, they work, otherwise you have to change approaches.

[3]: https://en.wikipedia.org/wiki/Kernel_method [4]: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

I think you're exactly right about the modularity aspect of DL; in fact I made a similar comment on this page, albeit speaking in terms of basis functions.

I have a minor nitpick regarding this point you make: not that anything is likely to beat RBF. Depending on the data, specialized kernels can help immensely. An easy example is sequence classification where something like a string kernel might work really well. Or image classification, where histogram based kernels might prove superior.

Note that sometimes you might want to measure how good a kernel is for a problem not by its prediction accuracy alone but also by the number of support vectors it needs - if the final model retains ~100% of the training data as support vectors, it is not a great model in some (subjective) sense since it is memorizing a lot. Depending on the data, you might "beat" the RBF kernel on this aspect too.

Regarding the training time, there are some interesting tricks I've come across (but not tried them out yet) -[1], [2].

[1] Ensemble SVMs http://www.jmlr.org/papers/volume15/claesen14a/claesen14a.pd...

[2] SVMPath - algorithm to fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf

Thanks for your point about BLUE and regularized solutions. I had not considered that point of view before