| HN Mirror

I've found that instance normalization usually gives better results so I prefer it over batch normalization.

With batch norm you learn four scalars per convolutional feature map: mu (mean), sigma (stddev), alpha (scale) and beta (shift). During training, mu and sigma are estimated from data statistics; during testing they are constants, either estimated from the entire training set or computed as a running mean during training. At test time the batch norm operation is then alpha * (x - mu) / sigma + beta, which is a linear operation since everything but x is constant; since it is linear it can be merged into a convolutional layer.

With instance norm, mu and sigma are estimated from data statistics during both training and testing; this means that the test-time forward pass is nonlinear, so it cannot be merged into a convolution (which is linear).