| HN Mirror

You're doing it again - presenting guesses as facts. Why would a resnet - a batch normalized network using ReLU activations suffer from vanishing gradient problem? Does it? Have you actually done the experiment you've described? I have, and I didn't see gradients vanish. Sometimes gradients exploded - likely from a bad weights initialization (to be clear - that's a guess), and sometimes they didn't, but even when they didn't the networks never converged. The best we can do is to say: "skip connections seem to help training deep networks, and we have a few guesses as why, none of which is very convincing".

We know, from the literature

Let's look at the literature:

1. Training Very Deep Neural Networks: Rethinking the Role of Skip Connections: https://orbilu.uni.lu/bitstream/10993/47494/1/OyedotunAl%20I... they're making a hypothesis that skip connections might help prevent transformation of activations into singular matrices, which in turn could lead to unstable gradients (or not, it's a guess).

2. Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization: https://openreview.net/pdf?id=LJohl5DnZf they are making some hypothesis about an optimal information flow through the network, and that a particular form of regularization helps improve this flow (no skip connections are needed).

3. Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers https://arxiv.org/abs/2203.08120: focus on initial conditions and propose better activation functions.

Clearly the issues are a bit more complicated than the vanishing gradients problem, and each of these papers offer a different explanation of why skip connections help.

It's similar to people building a bridge in 15th century - there was empirical evidence and intuition of how bridges should be built, but very little theory explaining that that evidence or intuition. Your statements are like "next time we should make the support columns thicker so that the bridge doesn't collapse", when in reality it collapsed due to the resonant oscillations induced by people marching on it in unison. Thicker columns will probably help, but they do nothing to improve understanding of the issue. They are just a guess.

That's why we need mathematicians looking at it, and attempting to formalize at least parts of the empirical evidence, so that someone, some day, will develop a compelling theory.