| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mjburgess 1084 days ago

Statistical learning can typically be phrased in terms of k nearest neighbours

In the case of NNs we have a "modal knn" (memorising) going to a "mean knn" ('generalising') under the right sort of training.

I'd call both of these memorising, but the latter is a kind of weighted recall.

Generalisation as a property of statistical models (ie., models of conditional freqs) is not the same property as generalisation in the case of scientific models.

In the latter a scientific model is general because it models causally necessary effects from causes -- so, necessarily if X then Y.

Whereas generalisation in associative stats is just about whether you're drawing data from the empirical freq. distribution or whether you've modelled first. In all automated stats the only diff between the "model" and "the data" is some sort of weighted averaging operation.

So in automated stats (ie., ML,AI) it's really just whether the model uses a mean.

4 comments

autokad 1084 days ago

I disagree, it feels like you are just fusing over words and not what's happening in the real world. If you were right, a human doesn't learn anything either, they just memories.

you can look at it by results: I give these models inputs its never seen before but it gives me outputs that are correct / acceptable.

you can look at it in terms of data: we took petabytes of data, and with an 8gb model (stable difusion) we can output an image of anything. That's an unheard of compression, only possible if its generalizing - not memorizing.

link

ActivePattern 1084 days ago

I'd be curious how much of the link you read.

What they demonstrate is a neural network learning an algorithm that approximates modular addition. The exact workings of this algorithm is explained in the footnotes. The learned algorithm is general -- it is just as valid on unseen inputs as seen inputs.

There's no memorization going on in this case. It's actually approximating the process used to generate the data, which just isn't possible using k nearest neighbors.

link

visarga 1084 days ago

> Statistical learning can typically be phrased in terms of k nearest neighbours

We have suspected that neural nets are a kind of kNN. Here's a paper:

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

https://arxiv.org/abs/2012.00152

link

bippihippi1 1084 days ago

it's been proven that all models learned by gradient descent are equivalent to kernel machines. interpolation isn't generalization. if theres a new input sufficiently different from the training data the behaviour is unknown

link

drdeca 1084 days ago

Can you say what that says about the behavior described with the modular arithmetic in the article?

And, in particular, how to interpret the fact that different hyperparameters determined whether runs, obtaining equally high accuracy on the training data, got good or bad scores on the test data, in terms of the "view it as a kernel machine/interpolation" lens?

My understanding is that the behavior in at least one of those "models learned by gradient descent are equivalent to [some other model]" papers, works by constructing something which is based on the entire training history of the network. Is that the kernel machines one, or some other one?

link

bippihippi1 1083 days ago

if you train a model on modular arithmatic, it can only learn what's in the training data. if all of the examples are of the form a + b mod 10, it isn't likely to generalize to be able to solve a + b mod 12. a human can learn the rule and figure it out. a model can't that's why a diverse training set is so important. it's possible to train a model to aproximate any function, but whether the approximation is accurate outside of the datapoints you trained on is not reliable, as far as I understand.

different hyperparameters can give a model that us over or underfit, but this helps the model interpolate, not generalize. it can know all the answers similar to the training data, not answers different to or it

link

xapata 1084 days ago

One weird trick ...

There's some fox and hedgehog analogy I've never understood.

link

visarga 1084 days ago

but when the model trains on 13T tokens it is hard to be OOD

link