Backpropagation with shared weights in convolutional neural networks

The success of deep convolutional neural networks would not be possible without weight sharing - the same weights being applied to different neuronal connections. However, this property also makes them more complicated. This post aims to give an intuition of how backpropagation works with weight sharing. For a more well-rounded introduction to backpropagation of convolutional neural networks, see Andrew Gibiansky’s blog post.

Backpropagation is used to calculate how the error in a neural network changes with respect to changes in a weight \(w\) in that neural network. In other words, it calculates:

\[\frac{\partial E}{\partial w}, \]

where \(E\) is the error and \(w\) is a weight.

For traditional feed-forward neural networks, each connection between two neurons has it’s own weight and the calculation of the backpropagation is generally straightforward using the chain rule. For example, if you know how the error changes with respect the node \(y_i\) (ie. \(\frac{\partial E}{\partial y_i}\)), then calculating the contribution of the pre-synaptic weights of that node is simply:

\[\frac{\partial E}{\partial w}=\frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial w}. \]

This is complicated in convolutional neural networks because the weight \(w\) is used for multiple nodes (often, most or all nodes in the same layer).

Handling shared weights

In classical convolutional neural networks, shared weights are handled by summing together each instance that the weight appears in backpropagation derivation, instead of, for example, taking the average of each occurrence. So, if layer \(y^l\) is the layer “post-synaptic” to the weight \(w\) and we have calculated the effect of layer on the error (\(\frac{\partial E}{\partial y^l}\)), then the weights are:

\[\frac{\partial E}{\partial w}=\sum_i\frac{\partial E}{\partial y^l_i} \frac{\partial y^l_i}{\partial w}, \]

where \(i\) specifies the node within layer \(l\).

So why is summation the correct operation? In essence, it is because when the paths from a weight (applied at different locations) merge, they do so with summation. For example, convolution involves summing the paths (in the dot-operation). Other operations such as max pooling and fully connected layers also involve summing the separate paths.

Simple example

Let’s take a very simple convolutional network.

Let layer \(y^0\) be a 2D input layer and \([w_0, 0, 0]\) a kernel that is applied to this convolutional layer. For simplicity, lets only have a single kernel. Then:

\[ x^1_{i}=w_0 y^0_{i} \]

An activation function is then applied to this result: \(y^1_i=h(x^1_{i})\).

For the next convolutional layer, let’s say that the kernel \([w_1,w_2,w_3]\) is applied. Then:

\[ \begin{aligned} x^2_{i}&=\sum_{a=1}^3 w_a y^1_{i+a-1} \\ &= w_1 y^1_i + w_2 y^1_{i+1} + w_3 y^1_{i+2} \\ &= w_1 h\left(w_0 y^0_{i}\right) + w_2 h\left(w_0 y^0_{i+1}\right) + w_3 h\left(w_0 y^0_{i+2}\right). \\ \end{aligned} \] and \[ y^2_{i} = h(x^2_{i}). \]

So we are interested in \(\frac{\partial E}{\partial w_0}\). Let’s say that the error is only effected by the \(j\)th node of the output: \(y^2_{j}\). Then:

\[\frac{\partial E}{\partial w_0} = \frac{\partial E}{\partial y^2_{i}}\frac{\partial y^2_{j}}{\partial x^2_j}\frac{\partial x^2_{j}}{\partial w_0} \]

Assume that we have \(\frac{\partial E}{\partial y^2_{j}}\) and \(\frac{\partial y^2_{j}}{\partial x^2_j}\), then we only need to solve for \(\frac{\partial x^2_{j}}{\partial w_0}\).

\[ \begin{aligned} \frac{\partial x^2_{j}}{\partial w_0}&=\frac{\partial}{\partial w_0} \left(\sum_{a=1}^3 w_a y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( h\left(w_0 y^0_{j+a-1}\right)\right)\\ &= w_1 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j}\right) + w_2 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+1}\right) + w_3 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+2}\right). \\ \end{aligned} \]

Notice that each occurrence of \(w_0\) is summed separately, and hence why backpropagation sums the shared weights in convolutional networks.

backpropagation dcnn vision deep-learning