Hacker News new | ask | show | jobs
by 988747 1729 days ago
> In practice, these huge models are, in laymans terms, fucking awesome and work really well e.g. they generalize and work in production. No one understands why.

How about the resulting weights? If most of them are close to 0, then that would mean that a part of the training is for NN to learn which of 1.5B parameters are relevant, and which are not.

2 comments

There is something called the golden ticket theory (maybe mentioned in the paper, I’m on my phone), that says indeed that the large models are effectively ensembles of massive random models, and the top levels of the network pick the one or two that randomly happen to work.

Maybe true but even then only part of the story, kernels in CNN genuinely seem to learn features like edges and textures.

There are two answers to this. First, empirically we see that the more parameters we add the better the model performs ==> Weights continue to contribute (and aren't dead) .

Second, there is a very popular paper called "The lottery ticket hypothesis" [1] that in any network you can find subnetworks that work just as well. e.g. The parameters are redundant. This was written in 2018, which is a long time ago in big NN world, so I'm not sure how it holds up to current insanity sized models.

[1]https://arxiv.org/abs/1803.03635

A couple notes...

1) Imagine the loss surface of a given model architecture; each point on the surface corresponds to a full set of weights, and the value at the point is the model loss. So, a billion-dimensional surface, give or take. There's a massive amount of flexibility in that space. Some models in the surface are sparse, but they are adjacent to models which are just as good but not sparse at all. Likewise, if you 'rotate' a sparse model, you can end up with an entirely equivalent dense model. So, you really need additional 'pressure' on the learning problem to ensure you actually get sparsity, even if the sparsity is in some sense natural.

2) IIUC, lottery ticket kinda breaks with larger models/problems. For small enough problems, the initial random projection given by the random starting weights is already good enough to build on. For bigger + more complicated problems, you need to really adapt in early training, and so lottery ticket breaks down.