A couple of things:
1. How can activations be below 0 in a ReLU network? It seems like:
> h_out = z * (z > 0)
is the line that you're using to store activations. How can that be below 0 for any values of z?
EDIT: Whups, missed that he was storing the output of activations + batchnorm. I think it would make more sense to just store the output of the activations. The goal here is to show that batch norm provides good properties throughout the entire network. In this case, you're storing h_out/std(h_out). Which is trivially normalizing your layers to have the same variance.
2. Using initialization distributions to motivate batch norm is a little bit misleading. Prior to batch norm, people had realized that initializing uniformly was a problem, and had switched to using xavier initialization (specifically, a follow up by Kaiming He found the initialization that works best for ReLU).
I do think that the intuition for why both make sense are fairly similar. Although xavier initialization fixes the variances initially, batch norm allows you to maintain it through training.
Another thing that's cool about your post is that my intuition was that xavier initialization was not necessary if one was also using batch norm. It's cool to see that vindicated.
Regarding point 1, I stored both the activations before batchnorm and after since I needed them during the backwards pass. i.e. I stored h_out before and after these operations:
Regarding point 2, I do realize now that fixing a bad init via Xavier/he initialization and using batch norm fix slightly different problems - if I were to rewrite this post I probably wouldn't talk about initialization at all, or at least mention the Xavier/He initialization.
If you're interested in more experiments, I think it would be interesting to take a look at the batch norm variance/bias parameters after training a network, as well as the variances of networks trained with/without batch norm.
One thing that's been somewhat confusing me is whether Goodfellow's explanation aligns with the "internal covariate shift" explanation.
Goodfellow's interpretation that batch norm reduces second order interactions among the variables doesn't seem to be equivalent to "internal covariate shift". In the first place, I have trouble understanding why batch norm works well on ReLU. I've heard people say that you want things to be mean centered without too much variance, otherwise ReLU loses it's nonlinearity. However, that would seem to imply that you want a bias of 0, and in that case, why even have that parameter?
Also, given this "internal covariate shift" explanation, why would batch norm after the activation generally work better?
The first plot of "Activation 0" appears to actually be the random input, if it corresponds to hs[0] in the code. The rest of the activation plots seem to all be strictly non-negative. The other plots with negative values are of gradients, not activations.
Ah yeah, my bad, I should've instead shown the activations after the first layer, since "activation 0" is just the distribution of the random data I started with
I actually think the idea of using leaky ReLUs is interesting, because it'll still provide a small gradient when x < 0, which perhaps may slightly alleviate the vanishing gradients issue
'Batch Normalization for Improved DNN Performance, My Ass' http://nyus.joshuawise.com/batchnorm.pdf