Hacker News new | ask | show | jobs
by thehappyfellow 897 days ago
This book is not aimed at practitioners but I don’t think that means it deserves to be called „actively the worst one”.

Even though the frontier of deep learning is very much empirical, there’s interesting work trying to understand why the techniques work, not only which ones do.

I’m sorry but saying proofs are not a good method for gaining understanding is ridiculous. Of course it’s not great for everyone but a book titled „Mathematical Introduction to x” is obviously for people with some mathematical training. For that kind of audience lemmas and their proof are natural way of building understanding.

1 comments

Just read the section on ResNets (Section 1.5) and tell me if you think that's the best way to explain ResNets to literally anyone. Tell me if, from that description, you take away that the reason skip connections improve performance is that they improve gradient flow in very deep networks.
the reason skip connections improve performance is that they improve gradient flow in very deep networks.

Can you prove this statement?

Neither do the authors in the book, and I'd argue that after (only) reading the book, the reader wouldn't be equipped to attempt this either (see my other post in this thread), so I think the parent poster has a point.
Yes, I have a very good point in fact. But the above comment purposely chooses not to argue with it, because it's easier to ignore it entirely and argue something else.
The problem is you presented something as a fact while it’s just a guess. Some people guess it’s an improved gradient flow, others guess it’s a smoother loss surface, someone else guesses it’s a shortcut for early layer information to reach later layers, etc. We don’t actually know why resnets work so well.
The point of that comment doesn't have anything to do with how ResNets actually work. You missed the actual point.

> We don’t actually know why resnets work so well.

Yes actually we do. We know, from the literature, that very deep neural networks suffered from vanishing gradients in their early layers in the same way traditional RNNs did. We know that was the motivation for introducing skip connections which gives us a hypothesis we can test. We can measure, using the test I described, the differences in the size of gradients in the early layers with and without skip connections. We can do this across many different problems for additional statistical power. We can analyze the linear case and see that the repeated matmults should lead to small gradients if their singular values are small. To ignore all of this and say that well we don't have a general proof that satisfies a mathematician so i guess we just don't know is silly.

Empirically yes, I can consider a very deep fully-connected network, measure the gradients in each layer with and without skip connections, and compare. I can do this across multiple seeds and run a statistical test on the deltas.
Empirical studies are only useful until the system is mathematically understood. For example, I can construct transformer circuits where the skip connection (provably) purely adds noise.

I can also prove in particular cases the MLP's sole purpose is to remove the noise added from the skip connection.