Hacker News new | ask | show | jobs
by quantombone 2720 days ago
The hinge-loss and the primal form of the SVM objective is really easy to understand. Every ML 101 class would jump into the dual formulation, talk about kernels, RKHS, and all the fancy stuff.

Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).

Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.

Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.

2 comments

It's funny how close everything is connected. It turns out you can even derive the hinge loss using a mixture of normal distributions, so it is also connected closely to OLS.

Some people have had good luck with hinge or multi-hinge loss for neural networks instead of the almost universal log loss, since of course the hinge loss can be used in things other than linear models. It doesn't care how you get the y output.

Wasn't the reason why everyone cared about SVMs was specifically because of the nonlinear kernel stuff that would push performance a smidge and produce new bests on existing benchmark datasets?

The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) )

The nonlinear kernel SVM works at least as well in primal, using just the representer theorem. Since it is unconstrained, all you need to do is create a kernel matrix and solve your system with your favorite convex optimizer like Newton's method (which also can work in lower dimensions).

Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.

I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.

You can also generate low-rank kernel approximations [0] using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.

This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.

[0] https://scikit-learn.org/stable/modules/kernel_approximation...