|
|
|
|
|
by tsurba
453 days ago
|
|
Proper initialization is more important. Batch norm and others are important for faster convergence due to forcing the model to focus creating second and higher order nonlinearities, as a simple shift in mean/std is normalized out, and thus the gradient does not point in a direction that would only change those properties of the output distribution. |
|