Hacker News new | ask | show | jobs
by cultus 2732 days ago
I notice this doesn't mention hinge loss, which is by far the simpler way of arriving at the SVM. Hinge loss is just max(0, 1- t*y), where y is the output of the linear model and t = +-1 is the label. Thus, it takes the common-sense approach of not penalizing losses that are far enough away from the decision boundary, and penalizing linearly after that.

An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive.

This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine.

Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual.

2 comments

The hinge-loss and the primal form of the SVM objective is really easy to understand. Every ML 101 class would jump into the dual formulation, talk about kernels, RKHS, and all the fancy stuff.

Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).

Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.

Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.

It's funny how close everything is connected. It turns out you can even derive the hinge loss using a mixture of normal distributions, so it is also connected closely to OLS.

Some people have had good luck with hinge or multi-hinge loss for neural networks instead of the almost universal log loss, since of course the hinge loss can be used in things other than linear models. It doesn't care how you get the y output.

Wasn't the reason why everyone cared about SVMs was specifically because of the nonlinear kernel stuff that would push performance a smidge and produce new bests on existing benchmark datasets?

The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) )

The nonlinear kernel SVM works at least as well in primal, using just the representer theorem. Since it is unconstrained, all you need to do is create a kernel matrix and solve your system with your favorite convex optimizer like Newton's method (which also can work in lower dimensions).

Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.

I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.

You can also generate low-rank kernel approximations [0] using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.

This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.

[0] https://scikit-learn.org/stable/modules/kernel_approximation...

Its similar to L2regularized logistic regression not the pure logistic regression. Although in your defense pure logistic is not used that often. I find the large margin view very intuitive -- one finds a separator such that the labeled data points are separated and lie as far as possible from the line
Logistic regression is usually assumed to be regularized in machine learning. It's not often you find practical problems that don't need regularization.
Indeed. You would notice that I had said as much. For the analogy between SVMs and LR to work you need L2 squared regression specifically, other regularizers, for example L1 wont work for the analogy. L2 squared is the key for the connection with Hilbert space and kernels to fall out automagically