This is the classic “ResNet” or Residual Network paper (He et al. 2015), which describes a method of making convolution neural networks with a depth of up to 152 layers trainable. The residual networks described in this paper won the ILSVRC 2015 classification task and many other competitions.
Ideas such as vanishing gradients are useful for understanding the paper. One of the important take-aways from this paper, though, is that preventing gradients from vanishing (or exploding) does not necessarily make it practical to find optimal solutions to very deep models.
The first author, Kaiming He, provides the Caffe model for the network and other useful resources at:
https://github.com/KaimingHe/deep-residual-networks
His CVPR 2016 talk, which reviews this paper, is available on YouTube.
This paper provides three main “take-away” models (plus the ensemble models). In this review, I am just going to talk about what a “residual” network is and the idea behind it, and I will not talk about the complete models. The main models are:
ResNet-50
ResNet-101
ResNet-152
There is a follow-up paper from the same authors, (He et al. 2016), that goes into more optimization of the architecture and is able to train up to 1,000 layers.
What is a residual network?
I will first describe the implementation of residual networks, because it is incredibly easy, while the reasoning behind it seems unintuitive at first. It is also simple to implement in any of the major deep learning frameworks. A basic feedforward convolutional network will often contain consecutive convolution layers, like the subfigure A below:
Part of a convolutional neural network (A) with out and (B) with shortcut connections.
Where the “in” and “out” blocks are the input and output activations (“blobs”) of this segment of the network.
This could also be written as:
H(x) = relu(conv(x))
where \(conv\) performs a convolution with bias and then performs batch normalization (with scaling and bias).
A residual network, on the other hand, adds a shortcut connection, as shown in subfigure B above. The addition operation is an element-wise addition. This could be written as:
F(x) = relu(conv(x) + x)
Note that this modification does not increase the number of parameters. In fact, the trained residual network can be converted to an identical plain convolutional network and vice versa.
So, why does this help?
Motivation
This part is a bit crazy. When increasing the depth of traditional networks, there is an initial increase in accuracy, then it plateaus, and then, if the depth is further increased, the accuracy will rapidly fall off. I know what your thinking: that the model is being over-fitted, but this is not the case. The training error also increases with increasing depth! Of course, the validation error also increases.
As a thought experiment, instead of adding the extra convolutional layers, the ones that increase the training error, add identity functions. This network will be just as easy to train as the original network (since, well, it is practically the same network). So classical neural networks that are too deep, you can improve the training and validation error by replacing some of the convolution layers with identity functions.
Instead of replacing the convolution layers with identity functions, they add identity functions as a shortcut connection between multiple convolution layers. This is how I think about it: adding the shortcut connections allows the networks to first learn the optimal solution where the “extra” layers are treated as identity functions. Once it finds this optimum, it is able to use the extra layers to improve on this solution.
You might think that the shortcut connections allows a path where the gradient doesn’t dimish, although this may be wrong. The authors mention that the Batch Normalization prevents the gradients from vanishing. They suggest that the plain networks (without shortcuts) may have exponentially low convergence rates.
I’m not intending to replace the original paper, so you should read it for more details. Hopefully this give you a good quick overview and motivation to dig in more!
Beyond Classical ResNet
In addition to the follow-up paper (He et al. 2016) there are some additional papers of go beyond this work. Chu, Yang, and Tadinada (2017) use visualization techniques to explore the internals of residual networks. Zagoruyko and Komodakis (2016) explore making residual networks “wider” - increasing the number of channels - and shallower. Huang et al. (2016) takes the idea of residual connections to the extreme. The basic idea is that each layer is connected to all of the later layers within a “dense block” (all layers have direct feedforward connections). There is also a wide and shallow variaent that follows the strategy of (Zagoruyko and Komodakis 2016). Huang et al. (2016) also introduced stochastic depth residual networks.
References
Chu, Brian, Daylen Yang, and Ravi Tadinada. 2017. “Visualizing Residual Networks.” ArXiv:1701.02362 [Cs], January. http://arxiv.org/abs/1701.02362.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” ArXiv:1512.03385 [Cs], December. http://arxiv.org/abs/1512.03385.
———. 2016. “Identity Mappings in Deep Residual Networks.” ArXiv:1603.05027 [Cs], March. http://arxiv.org/abs/1603.05027.
Huang, Gao, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. 2016. “Densely Connected Convolutional Networks.” ArXiv:1608.06993 [Cs], August. http://arxiv.org/abs/1608.06993.
Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. 2016. “Deep Networks with Stochastic Depth.” ArXiv:1603.09382 [Cs], March. http://arxiv.org/abs/1603.09382.
Zagoruyko, Sergey, and Nikos Komodakis. 2016. “Wide Residual Networks.” ArXiv:1605.07146 [Cs], May. http://arxiv.org/abs/1605.07146.
Read More ›