Introduction
- Training a DNN is difficult — overparametrized model is sensitive to many factors
- There is one problem in training a DNN — Internal Covariate Shift (A recent paper discusses the success of BN is not due to ICS but this is not the topic of discussion here)
- The author’s precise definition of ICS is:
We define Internal Covariate Shift as the change in the distribution of network activations due to the change in network parameters during training.
Mechanism
- We obtain sample mean and sample variance along the batch dimension.
- Then, we normalize the given batch using the obtained sample mean and variance.
- After, using learnable-parameters γ and β, we send the normalized distribution into some “desired (this is why they are learnable)” distribution that is easier for model to learn.
Effects
- Make training converge faster by smoothening the loss surface. Smoothening effect is desirable as it diminishes mal-effect of various types of surface such as saddle points.
- Also, we can be less careful of initialization as well.
- This also gives a slight regularization effect. (Believe or not!)