|
|
|
|
|
by a-dub
2722 days ago
|
|
Wasn't the reason why everyone cared about SVMs was specifically because of the nonlinear kernel stuff that would push performance a smidge and produce new bests on existing benchmark datasets? The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) ) |
|
Second-order methods like Newton's method converge better to the exact solution that SGD, although they don't reach a "pretty good" solution as fast usually. Coordinate descent methods in the dual also get very close to the exact solution, but Newton's method and friends are usually faster. With (quasi-) Newton methods in the primal, everything just comes down to solving linear systems, which is a much more well-studied problem.
I've even experimented successfully with kernelized models with millions of examples in low dimensions using the Fast Gauss Transform. That's impossible in the dual.
You can also generate low-rank kernel approximations [0] using the Nystroem method or the Fastfood transform that can then be used in a linear SVM. For example, if I had a problem with n=10^6, I can make a low-rank approximation of the kernel matrix (say d=1000) and feed that into a fast SGD optimizer.
This often works really well, and is usually pretty close to the exact kernel solution if the problem is of lower intrinsic dimensionality, which is usually true if the dual SVM is sparse in its basis vectors. This largely negates the sparsity advantage of the primal SVM. If a kernel approximation isn't good, then the dual SVM wouldn't be meaningfully sparse anyway, so there is still no advantage of dual. Best just solve the kernelized system in the primal, and use a second-order optimizer if needed.
[0] https://scikit-learn.org/stable/modules/kernel_approximation...