| HN Mirror

If you're interested in more experiments, I think it would be interesting to take a look at the batch norm variance/bias parameters after training a network, as well as the variances of networks trained with/without batch norm.

One thing that's been somewhat confusing me is whether Goodfellow's explanation aligns with the "internal covariate shift" explanation.

Goodfellow's interpretation that batch norm reduces second order interactions among the variables doesn't seem to be equivalent to "internal covariate shift". In the first place, I have trouble understanding why batch norm works well on ReLU. I've heard people say that you want things to be mean centered without too much variance, otherwise ReLU loses it's nonlinearity. However, that would seem to imply that you want a bias of 0, and in that case, why even have that parameter?

Also, given this "internal covariate shift" explanation, why would batch norm after the activation generally work better?