Batch Normalization for deep networks

Y	Hacker News new \| ask \| show \| jobs

	Batch Normalization for deep networks (rohanvarma.me)
	34 points by rvarma 3043 days ago

2 comments

nafizh 3043 days ago

This paper might be interesting regarding this post.

'Batch Normalization for Improved DNN Performance, My Ass' http://nyus.joshuawise.com/batchnorm.pdf

link

laythea 3043 days ago

That made me chuckle. So true. Thanks :)

link

chillee 3043 days ago

A couple of things: 1. How can activations be below 0 in a ReLU network? It seems like:

> h_out = z * (z > 0)

is the line that you're using to store activations. How can that be below 0 for any values of z?

EDIT: Whups, missed that he was storing the output of activations + batchnorm. I think it would make more sense to just store the output of the activations. The goal here is to show that batch norm provides good properties throughout the entire network. In this case, you're storing h_out/std(h_out). Which is trivially normalizing your layers to have the same variance.

2. Using initialization distributions to motivate batch norm is a little bit misleading. Prior to batch norm, people had realized that initializing uniformly was a problem, and had switched to using xavier initialization (specifically, a follow up by Kaiming He found the initialization that works best for ReLU).

I do think that the intuition for why both make sense are fairly similar. Although xavier initialization fixes the variances initially, batch norm allows you to maintain it through training.

Another thing that's cool about your post is that my intuition was that xavier initialization was not necessary if one was also using batch norm. It's cool to see that vindicated.

link

rvarma 3043 days ago

Thanks for your comment!

Regarding point 1, I stored both the activations before batchnorm and after since I needed them during the backwards pass. i.e. I stored h_out before and after these operations:

h_out = (h_out - np.mean(h_out, axis = 0)) / np.std(h_out, axis = 0) h_out = gamma * h_out + beta

Regarding point 2, I do realize now that fixing a bad init via Xavier/he initialization and using batch norm fix slightly different problems - if I were to rewrite this post I probably wouldn't talk about initialization at all, or at least mention the Xavier/He initialization.

link

chillee 3043 days ago

If you're interested in more experiments, I think it would be interesting to take a look at the batch norm variance/bias parameters after training a network, as well as the variances of networks trained with/without batch norm.

One thing that's been somewhat confusing me is whether Goodfellow's explanation aligns with the "internal covariate shift" explanation.

Goodfellow's interpretation that batch norm reduces second order interactions among the variables doesn't seem to be equivalent to "internal covariate shift". In the first place, I have trouble understanding why batch norm works well on ReLU. I've heard people say that you want things to be mean centered without too much variance, otherwise ReLU loses it's nonlinearity. However, that would seem to imply that you want a bias of 0, and in that case, why even have that parameter?

Also, given this "internal covariate shift" explanation, why would batch norm after the activation generally work better?

link

haraldurt 3043 days ago

The first plot of "Activation 0" appears to actually be the random input, if it corresponds to hs[0] in the code. The rest of the activation plots seem to all be strictly non-negative. The other plots with negative values are of gradients, not activations.

link

rvarma 3043 days ago

Ah yeah, my bad, I should've instead shown the activations after the first layer, since "activation 0" is just the distribution of the random data I started with

link

gwern 3043 days ago

Initialization?

> W = np.random.normal(0, np.sqrt(2/(h.shape[0] + layer_dim[i])), size = (layer_dim[i], h.shape[0]))

A N(0, sqrt(2/width)) would produce negative values.

link

chillee 3043 days ago

I was talking about the graphs here: https://i.imgur.com/M6P71aC.jpg

I missed that he's not storing activations for those graphs, he's storing activations+batch norm. See my edit.

link

singularity2001 3043 days ago

also there are 'leaking' ReLUs

f(x) = a if x<0 else b

usually 0 < a << b

link

rvarma 3043 days ago

I actually think the idea of using leaky ReLUs is interesting, because it'll still provide a small gradient when x < 0, which perhaps may slightly alleviate the vanishing gradients issue

link

chillee 3043 days ago

I'm aware. He's using ReLU in this case though.

link