| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 99decisionstr 3939 days ago
	Not quite - it's different from the kernel trick, which is impossible at this scale (there's no way you can train an RBF kernel in a decent amount of time when your space has 10^8 features and your training set has 10^9 observations). The idea of factorization machine is to learn the polynomial kernel, and it provides a mechanism to do so which scales "linearly" (with some arbitrary constant that you're in control of).

2 comments

xtacy 3939 days ago

Oops, sorry, I didn't mean to say you used a RBF kernel. I edited my comment above. I meant to say your embedding resembles the polynomial kernel (not RBF, which was just meant as a generalisation of such neat tricks :)).

What I meant to say was that you didn't need to compute the embedding explicitly. Since you embed into a space that has a nice structure, you can compute the dot product of the embedded vectors without having to compute the embedding explicitly.

link

srean 3939 days ago

Could you elaborate on its differences from the kernel trick. It seems to me @xtacy is exactly right on that modeling point. Training the model using scale appropriate optimization is an orthogonal concern, if that is what you were getting at, i.e. an off the shelf kernelization would not be sufficient to run this at scale.

link

99decisionstr 3939 days ago

In very large-scale sparse settings, the optimization strategy is tightly coupled to the modeling, not orthogonal to it. The reason why is that "data beats algorithm": meaning that a "dumb" model that can be trained on 1TB of data using a "dumb" optimizer will easily outperform a fancier model for which we only have at our disposal more complex optimizers that can only eat 1GB of data in the same amount of time. This paper is a classic when it comes to the kind of trade-offs you have to deal with in large-scale web settings: http://papers.nips.cc/paper/3323-the-tradeoffs-of-large-scal...

The kernel trick uses an implicit mapping into a higher-dimensional feature-space. On the other hand, Factorization Machines uses an explicit mapping into the polynomial kernel space. However it learns jointly the "right" polynom, by mapping the base features into a low-dimensional dense space where the higher-order terms of the polynom are dot-products (we're not taking dot-products in the original feature-space!). FM then learns the right polynomial (a non-convex taks) jointly with learning the original supervised learning task.

link

srean 3939 days ago

Thanks for replying in spite of the fact that the audience would have moved to other threads by now. I am totally with you in the claim that to deploy a solution one has to be fully cognizant of its optimization theoretic implications. What I meant by orthogonal is the the properties of the model does not care what algorithm was used to optimize provided you reach an unique optimizer. Here it is tricky to make the same claim because although convex in a set of variables, its not jointly convex.

Where I am not with you is in the hint that kernels have feature maps that are necessarily opaque. In fact inhomogeneous poly kernels ar great examples where the feature map is known in closed form. Although such a map is available it is not always efficient to use it. But what it lets you do is optimize the weights of each dimension on tje mapped feature space but executed in the native space. In fact, and I am greatly tickled by the coincidence, I had sent a mail to colleagues where I suggested a form where instead of the standard Euclidean dot product in the poly kernel is replaced by a convex combination of other favourite kernel of choice. The training algo remains almost the same. I was not aware of this piece of work but the mechanism is pretty much the same (little more general. Essentially replaces the std dot product that appear in the kernel expression by other kernel expressions. If you are thinking recursion now well that's intended) and can be pushed further.

link