Review of Ioffe, Szegedy (2015) Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Normalization of training inputs has long been shown to increase the speed of learning in networks. The paper (Ioffe and Szegedy 2015) introduces a major improvement in deep learning, batch normalization (BN), which extends this idea by normalizing the activity within the network, across mini-batches (batches of training examples).

BN has been gaining a lot of traction in the academic literature, for example being used to improve segmentation (Hong, Noh, and Han 2015) and variational autoencoders (Sønderby et al. 2016).

The authors state that adding BN allows a version of the Inception image classification model to learn with 14 times fewer training steps, when additional modifications are made in order to take advantage of BN. One of the modifications is removing the Dropout layers, because BN acts as a regularizer and actually eliminates the need for Dropout. It also allows for the learning rate to be increased. It does all this while actually adding a small number of parameters to be learned during training. A non-batch version of BN may even have a biological homolog: homeostatic plasticity.

BN separates the learning of the overall distribution of the activity of the neuron and the specific synaptic weights. For each “activation” $x^{(k)}$, the parameters for the mean and spread of the distribution of the activation is given by the learned parameters $\beta^{(k)}$ and $\gamma^{(k)}$ respectively.

Details

The original paper (Ioffe and Szegedy 2015) states that the normalization should be done per activation $k$. In the initial part of the paper the definition of activation is left open. In their experiments, however, they do the normalization across each feature map (across batches and locations, for a specific feature).

Batch normalization step

For now, this section just regurgitates some of the basic information from the original paper.

Let $x^{(k)}_i$ be a specific activation $k$ for a given input$i$. Batch normalization then normalizes this activation over all the inputs of the batch (mini-batch) of inputs $i \in \{ 1 \ldots m \}$.

BN normalizes the data to a Gaussian where the mean and variation of the Gaussian is learned during training. This is done by first normalizing the data to the standard Gaussian ($\mu=0$ and $\sigma=1$), and then adding the offsets $\beta^{(k)}$ and scaling by $\gamma^{(k)}$.

Let $\mathcal B = \left\{ x_{1 \ldots m}\right\}$ be a given batch.

$\mu ^{k}_ \mathcal B$ and ${( ^{k}_B )} ^ 2 $ be the mean and variance of a given activation, $k$, across the batch of training inputs.

The normalization / whitening step is then:

\[ \hat x_i = \frac {x_i - \mu_{\mathcal B}} {\sqrt{\sigma_{\mathcal B}^2 + \epsilon}}. \]

And then there is the re-scaling and shifting step:

\[ y^{(k)}_i = \gamma^{(k)} \hat x_i + \beta^{(k)}, \]

where $\gamma^{(k)}$ and $\beta^{(k)}$, once again, are learned parameters.

Discussion on Caffe’s Implementation

There is an interesting discussion on Caffe’s implementation in the pull request (PR):

https://github.com/BImplementationVLC/caffe/pull/3229

Modifying models for BN

Adding BN by itself can speedup training. However, in order to fully take advantage of the BN, additional steps need to be made. The authors suggestions include:

Increase the learning rate (how much?).
Remove Dropout.
Reduce the L₂ weight regularization by a factor of 5.
Increase the learning rate decay by 6.
Perform “within-shard” shuffling - although I don’t know what this is.

References

Hong, Seunghoon, Hyeonwoo Noh, and Bohyung Han. 2015. “Decoupled Deep Neural Network for Semi-Supervised Semantic Segmentation.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1495–1503. Curran Associates, Inc. http://papers.nips.cc/paper/5858-decoupled-deep-neural-network-for-semi-supervised-semantic-segmentation.pdf.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167 [Cs], February. http://arxiv.org/abs/1502.03167.

Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks.” arXiv:1602.02282 [Cs, Stat], February. http://arxiv.org/abs/1602.02282.

2016-04-16 ARTICLE-REVIEWS · DEEP-LEARNING
batch-normalization deep-learning