|
|
|
|
|
by cultus
2732 days ago
|
|
I notice this doesn't mention hinge loss, which is by far the simpler way of arriving at the SVM. Hinge loss is just max(0, 1- t*y), where y is the output of the linear model and t = +-1 is the label. Thus, it takes the common-sense approach of not penalizing losses that are far enough away from the decision boundary, and penalizing linearly after that. An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive. This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine. Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual. |
|
Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).
Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.
Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.