| A couple of things:
1. How can activations be below 0 in a ReLU network? It seems like: > h_out = z * (z > 0) is the line that you're using to store activations. How can that be below 0 for any values of z? EDIT: Whups, missed that he was storing the output of activations + batchnorm. I think it would make more sense to just store the output of the activations. The goal here is to show that batch norm provides good properties throughout the entire network. In this case, you're storing h_out/std(h_out). Which is trivially normalizing your layers to have the same variance. 2. Using initialization distributions to motivate batch norm is a little bit misleading. Prior to batch norm, people had realized that initializing uniformly was a problem, and had switched to using xavier initialization (specifically, a follow up by Kaiming He found the initialization that works best for ReLU). I do think that the intuition for why both make sense are fairly similar. Although xavier initialization fixes the variances initially, batch norm allows you to maintain it through training. Another thing that's cool about your post is that my intuition was that xavier initialization was not necessary if one was also using batch norm. It's cool to see that vindicated. |
Regarding point 1, I stored both the activations before batchnorm and after since I needed them during the backwards pass. i.e. I stored h_out before and after these operations:
h_out = (h_out - np.mean(h_out, axis = 0)) / np.std(h_out, axis = 0) h_out = gamma * h_out + beta
Regarding point 2, I do realize now that fixing a bad init via Xavier/he initialization and using batch norm fix slightly different problems - if I were to rewrite this post I probably wouldn't talk about initialization at all, or at least mention the Xavier/He initialization.