| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bippihippi1 1086 days ago
	it's been proven that all models learned by gradient descent are equivalent to kernel machines. interpolation isn't generalization. if theres a new input sufficiently different from the training data the behaviour is unknown

3 comments

drdeca 1085 days ago

Can you say what that says about the behavior described with the modular arithmetic in the article?

And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?

My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?

link

bippihippi1 1085 days ago

if you train a model on modular arithmatic, it can only learn what's in the training data. if all of the examples are of the form a + b mod 10, it isn't likely to generalize to be able to solve a + b mod 12. a human can learn the rule and figure it out. a model can't that's why a diverse training set is so important. it's possible to train a model to aproximate any function, but whether the approximation is accurate outside of the datapoints you trained on is not reliable, as far as I understand.

different hyperparameters can give a model that us over or underfit, but this helps the model interpolate, not generalize. it can know all the answers similar to the training data, not answers different to or it

link

xapata 1086 days ago

One weird trick ...

There's some fox and hedgehog analogy I've never understood.

link

visarga 1086 days ago

but when the model trains on 13T tokens it is hard to be OOD

link