| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tbalsam 1084 days ago

ML researcher here wanting to offer a clarification.

L1 induces sparsity. Weight decay explicitly _does not_, as it is L2. This is a common misconception.

Something a lot of people don't know is that weight decay works because when applied as regularization it causes the network to approach the MDL, which reduces regret during training.

Pruning in the brain is somewhat related, but because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression, it's basically a different motif entirely.

If you need a hint here on this one, think about the implicit biases of different representations and the downstream impacts that they can have on the learned (or learnable) representations of whatever system is in question.

I hope this answers your question.

3 comments

mmmmpancakes 1084 days ago

can you please spell out what MDL is an acronym for?

link

sva_ 1084 days ago

https://en.wikipedia.org/wiki/Minimum_description_length

link

mmmmpancakes 1084 days ago

thanks

link

naasking 1084 days ago

> because the brain uses sparsity to (fundamentally, IIRC) induce representations instead of compression

What's the evidence for this?

link

heyitsguay 1084 days ago

https://bernstein-network.de/wp-content/uploads/2021/03/Lect... this has an awesome overview of the current understanding of neural encoding mechanisms.

link

tbalsam 1084 days ago

I enjoyed this presentation, thank you for sharing it. Good stuff in here.

I think things are a bit off about the reasoning behind the basis functions, but as I noted elsewhere here that's work I'm not entirely able to talk about as I'm actively working on developing it right now, and will release it when I can.

However, you can see some of the empirical consequences of an updated understanding on my end of encoding and compression in a release of hlb-CIFAR10 that's coming up soon that should cut out another decent chunk of training time. As a part of it, we reduce the network from a ResNet8 architecture to a ResNet7, and we additionally remove one of the (potentially less necessary) residuals. It is all 'just' empirical, of course, but long-term, as they say, the proof is in the pudding, since things are already so incredibly tightened down.

link

joaogui1 1084 days ago

That looks interesting, do you know what paper talks about the connection between MDL, regret, and weight decay?

link

tbalsam 1084 days ago

I would start with Shannon's information theory and the Wikipedia page on L2/the MDL as a decent starting point.

For the first, there are a few good papers that simplify the concepts even further.

link

joaogui1 1079 days ago

Sorry, I know what MDL and L2 regularization are, I would like the paper that connects them in the way you mentioned

link