| HN Mirror

The nonlinear kernel SVM works at least as well in primal, using just the representer theorem. Since it is unconstrained, all you need to do is create a kernel matrix and solve your system with your favorite convex optimizer like Newton's method (which also can work in lower dimensions).

Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.

I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.

You can also generate low-rank kernel approximations [0] using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.

This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.

[0] https://scikit-learn.org/stable/modules/kernel_approximation...