| I'll take a stab at it (sorry if I'm not getting to the heart of your question and wrote about things you're already familiar with). I think there are two key levels of understanding about the kernel trick (the observation that often you only ever use a "kernel function" k(x,x') which roughly measures similarity between samples, rather than all the features of the samples themselves). > "... if there was actually a simple transformation, I would just transform it in the first place and use a linear model" The first level of usefulness of the kernel trick is that it allows us to bypass this. Even when we know the transformation, bypassing the explicit transformation can be vastly more computationally efficient. Say our data comes as x = (x_1, ..., x_n) but we suspect that better features would be the the second degree terms: T(x) = (x_i x_j)_{1 <= i <= j <= n}. So T maps R^n to R^{n^2}. So our data matrix could go from 10^3 wide to 10^6 wide (terrible!). And then we want to compute the inner products of two samples mapped into this higher dimensional space R^{n^2}, which will be an O(n^2) operation. Alternatively, if we focus in on the fact that we really only need the inner product of the transformed samples (not actually the transformed samples themselves), we see that what we want is: Sum_{1<=i<=j<=n} (x_i x_j) (x'_i x'_j) = [Sum_{1<= i <= n} x_i x'_i ]^2 = <x,x'>^2 where <x,x'> is the normal inner product in R^n. So we can define k(x,x') = <x,x'>^2, and computing k(x,x') like this is only an O(n) operation (whereas not using the kernel trick and going the explicit transformation route leads to O(n^2) operations for computing inner products in the higher dimensional space). So we've seen that, even when the transformation to be applied is simple and known, avoiding it with the kernel trick can vastly improve the speed and memory usage of the model. The second level of understanding the kernel trick is observing that a kernel is simply measuring similarity between two samples in some way. We can conjure kernel functions that create a notion of similarity that we want to try out (or suspect would be good for our data), without ever having to think about what kind of transformation of the data would lead to an inner product in a higher dimension space that leads to that similarity. Let's make one right now. Say we want two samples x and x' to be similar if they are close (in R^n) and not similar if they are not close, but we really want to exaggerate this. We may imagine there's some threshold (that if two samples are 1 unit away from each other, that's quite similar, but being 3 units away isn't 1/3rd as similar but far far less similar) we really want to "peak" similarity in a tight radius. Then we could use k(x,x') = exp(- |x-x'|^2), since this only has a value near 1 if x and x' are quite close and drops off rapidly to 0 as x and x' get further apart. How rapidly should the similarity drop off as they get further apart? That's probably a parameter we may want to experiment with, let's go with k(x,x') = exp(- gamma * |x-x'|^2) instead. We've just invented Radial Basis Function (RBF) kernels (or Gaussian kernels) ! Do we have any idea what explicit transformation we would do to our data to get an inner product in a higher dimensional space that leads to this same function k(x,x')? Nope. Regardless, do we have a notion of similarity that may be very useful for our data? Yup. So the kernel trick transforms the harder problem of thinking up a transformation to a higher dimensional space where the data can be easily separated, into the easier problem of thinking up good notions of similarity between samples. But you're right - you still need to have some type of understanding of your data to intuit what a good kernel function will be for your problem. That's part of the art (unfortunately, less of a science) of being good at training SVMs. If you have no idea at all, most people will go with a Gaussian kernel and just see how that goes. Knowing all the common kernels and when to use which is basically the SVM equivalent of hyperparameter tuning in NNs - the model doesn't learn itself which ones are good, despite that there are some common-wisdom good defaults, and you can squeeze out some extra performance by knowing how to select the good ones from experience (or brute force searching all options). I need some practice in being more concise, but hopefully some of this helps. |