Hacker News new | ask | show | jobs
by rvarma 3043 days ago
Thanks for your comment!

Regarding point 1, I stored both the activations before batchnorm and after since I needed them during the backwards pass. i.e. I stored h_out before and after these operations:

h_out = (h_out - np.mean(h_out, axis = 0)) / np.std(h_out, axis = 0) h_out = gamma * h_out + beta

Regarding point 2, I do realize now that fixing a bad init via Xavier/he initialization and using batch norm fix slightly different problems - if I were to rewrite this post I probably wouldn't talk about initialization at all, or at least mention the Xavier/He initialization.

1 comments

If you're interested in more experiments, I think it would be interesting to take a look at the batch norm variance/bias parameters after training a network, as well as the variances of networks trained with/without batch norm.

One thing that's been somewhat confusing me is whether Goodfellow's explanation aligns with the "internal covariate shift" explanation.

Goodfellow's interpretation that batch norm reduces second order interactions among the variables doesn't seem to be equivalent to "internal covariate shift". In the first place, I have trouble understanding why batch norm works well on ReLU. I've heard people say that you want things to be mean centered without too much variance, otherwise ReLU loses it's nonlinearity. However, that would seem to imply that you want a bias of 0, and in that case, why even have that parameter?

Also, given this "internal covariate shift" explanation, why would batch norm after the activation generally work better?