Hacker News new | ask | show | jobs
by talolard 1725 days ago
There are two answers to this. First, empirically we see that the more parameters we add the better the model performs ==> Weights continue to contribute (and aren't dead) .

Second, there is a very popular paper called "The lottery ticket hypothesis" [1] that in any network you can find subnetworks that work just as well. e.g. The parameters are redundant. This was written in 2018, which is a long time ago in big NN world, so I'm not sure how it holds up to current insanity sized models.

[1]https://arxiv.org/abs/1803.03635

1 comments

A couple notes...

1) Imagine the loss surface of a given model architecture; each point on the surface corresponds to a full set of weights, and the value at the point is the model loss. So, a billion-dimensional surface, give or take. There's a massive amount of flexibility in that space. Some models in the surface are sparse, but they are adjacent to models which are just as good but not sparse at all. Likewise, if you 'rotate' a sparse model, you can end up with an entirely equivalent dense model. So, you really need additional 'pressure' on the learning problem to ensure you actually get sparsity, even if the sparsity is in some sense natural.

2) IIUC, lottery ticket kinda breaks with larger models/problems. For small enough problems, the initial random projection given by the random starting weights is already good enough to build on. For bigger + more complicated problems, you need to really adapt in early training, and so lottery ticket breaks down.