|
|
|
|
|
by SimplyUnknown
1044 days ago
|
|
First of all, great blog post with great examples. Reminds me of distill.pub used to be. Second, the article correctly states that typically L2 weight decay is used, leading to a lot of weights with small magnitudes. For models that generalize better, would it then be better to always use L1 weight decay to promote sparsity in combination with longer training? I wonder whether deep learning models that only use sparse fourier features rather than dense linear layers would work better... |
|
Longer answer: deep learning models are usually trying to find the best nonlinear basis in which to represent inputs; if the inputs are well-represented (read that as: can be sparsely represented) in some basis known a-priori, it usually helps to just put them in that basis, e.g., by FFT’ing RF signals.
The challenge is that the overall-optimal basis might not be the same as those of any local minima, so you’ve got to do some tricks to nudge the network closer.