Hacker News new | ask | show | jobs
by olooney 2720 days ago
People who know more about deep learning than I do tend to argue that there is empirical evidence that non-convexity is a non-issue because the performance of local minima will be close to the performance of the global minimum, given a sufficient number of nodes[1][2]. One such quote:

    I once ran a small neural net 100 times on simple three-dimensional data re- 
    selecting the initial weights to be small and random on each run. I found 32 
    distinct minima, each of which gave a different picture, and having about equal 
    test set error.

    – Leo Breiman
[1]: https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...

[2]: https://stats.stackexchange.com/questions/203288/understandi...

Note also that "optimal" in each case means "relative to other models in the same family." But a large neural net and an SVM do not exist in the same hypothesis space. As an analogy, recall that Ordinary Least Squares is the BLUE (Best Linear Unbiased Estimator.) But that's only relative to other unbiased (which in particular means no regularization: not ridge/lasso/elasticnet) linear models over the same feature set. Just because OLS provably gives the best performance in this family doesn't mean it is going to do well compared to models outside that family. In the same way, just because SVM offers a convergence guarantee doesn't mean its performance will always be the same or better than an NN.

The real problem with SVMs is that when you use a kernel (which is the whole point, linear SVMs are just logistic regression with hinge loss) then you introduce one landmark (feature) for every data point. So if your original data set was a manageable 1e6x1e3 then the SVM will view this data set as 1e6x1e6. It doesn't actually have to store that in memory thanks to the kernel trick[3] but training time still scales as O(N^2) where N is the number of observations (rows) in the original data. In practice, even with a highly optimized library like LIBSVM[4] you're not going to get good performance past N=1e6. (I've personally had the most success with SVMs on small datasets of only a few thousand rows.) While NN can very easily accommodate much later training sets with training time only growing as O(N), more or less. You can always sample a manageable training set from a larger data set but if your only using 1% of your data to train a model that's a problem.

Deep NN is also a far more modular approach: if you know your data has a 2D structure for example, you can add some CNN layers and get translational invariance. Deep NN comes with a large toolbox for specializing the model to the data, while the options for tuning an SVM are much more limited: you can change out the kernel function (not that anything is likely to beat RBF) and play around with the regularization parameter, and that's it. Once you've tuned those hyper-parameters with a small grid search there's not much else left to try with SVMs. If they work, they work, otherwise you have to change approaches.

[3]: https://en.wikipedia.org/wiki/Kernel_method [4]: https://www.csie.ntu.edu.tw/~cjlin/libsvm/

2 comments

I think you're exactly right about the modularity aspect of DL; in fact I made a similar comment on this page, albeit speaking in terms of basis functions.

I have a minor nitpick regarding this point you make: not that anything is likely to beat RBF. Depending on the data, specialized kernels can help immensely. An easy example is sequence classification where something like a string kernel might work really well. Or image classification, where histogram based kernels might prove superior.

Note that sometimes you might want to measure how good a kernel is for a problem not by its prediction accuracy alone but also by the number of support vectors it needs - if the final model retains ~100% of the training data as support vectors, it is not a great model in some (subjective) sense since it is memorizing a lot. Depending on the data, you might "beat" the RBF kernel on this aspect too.

Regarding the training time, there are some interesting tricks I've come across (but not tried them out yet) -[1], [2].

[1] Ensemble SVMs http://www.jmlr.org/papers/volume15/claesen14a/claesen14a.pd...

[2] SVMPath - algorithm to fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf

Thanks for your point about BLUE and regularized solutions. I had not considered that point of view before