| ML researcher here wanting to offer a clarification. L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception. Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training. Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely. If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question. I hope this answers your question. |