| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by whiteandnerdy 499 days ago
	You're correct, and the term you're looking for is "regularisation". There are two common ways of doing this: * L1 or L2 regularisation: penalises models whose weight matrices are complex (in the sense of having lots of large elements) * Dropout: train on random subsets of the neurons to force the model to rely on simple representations that are distributed robustly across its weights

1 comments

levocardia 499 days ago

Dropout is roughly equivalent to layer-specific L2 regularization, and it's easy to see why: asymptotically, dropping out random neurons will achieve something similar to shrinking weights towards zero proportional to their (squared) magnitude.

Trevor Hastie's Elements of Statistical Learning has a nice proof that (for linear models) L2 regularization is also semi-equivalent to dimensionality reduction, which you could use to motivate a "simplicity prior" idea in deep learning.

Yet another way of thinking about it, in the context of ReLU units, is that a layer of ReLUs forms a truncated hyper-plane basis (like splines but in higher dimensions) in feature space, and regularization induces smoothness in this N-dimensional basis by shrinking that basis towards being a flat hyper-plane

whiteandnerdy 499 days ago

Wow! I think I dimly intuited your first paragraph already; I directionally get why your second might be true (although I'd have thought L1 was even more so, since it encourages zeros which is kind of like choosing a subspace).

Your third paragraph took me ages to get an intuition for - is the idea that regularisation penalises having "sharp elbows" at the join points of your hyper-spline thing? That's mind blowing and such an interesting way to think about what a ReLU layer is doing.

Thanks so much for a thought provoking comment, that's incredibly cool.