Blog Archive

Click on a headline to read the teaser.

Review of He et al. 2015 *Deep Residual Learning for Image Recognition*
This is the classic “ResNet” or Residual Network paper (He et al. 2015), which describes a method of making convolution neural networks with a depth of up to 152 layers trainable. The residual networks described in this paper won the ILSVRC 2015 classification task and many other competitions. Ideas such as vanishing gradients are useful for understanding the paper. One of the important take-aways from this paper, though, is that preventing gradients from vanishing (or exploding) does not necessarily make it practical to find optimal solutions to very deep models. The first author, Kaiming He, provides the Caffe model for the network and other useful resources at: His CVPR 2016 talk, which reviews this paper, is available on YouTube. This paper provides three main “take-away” models (plus the ensemble models). In this review, I am just going to talk about what a “residual” network is and the idea behind it, and I will not talk about the complete models. The main models are: ResNet-50 ResNet-101 ResNet-152 There is a follow-up paper from the same authors, (He et al. 2016), that goes into more optimization of the architecture and is able to train up to 1,000 layers. What is a residual network? I will first describe the implementation of residual networks, because it is incredibly easy, while the reasoning behind it seems unintuitive at first. It is also simple to implement in any of the major deep learning frameworks. A basic feedforward convolutional network will often contain consecutive convolution layers, like the subfigure A below: Part of a convolutional neural network (A) with out and (B) with shortcut connections. Where the “in” and “out” blocks are the input and output activations (“blobs”) of this segment of the network. This could also be written as: H(x) = relu(conv(x)) where \(conv\) performs a convolution with bias and then performs batch normalization (with scaling and bias). A residual network, on the other hand, adds a shortcut connection, as shown in subfigure B above. The addition operation is an element-wise addition. This could be written as: F(x) = relu(conv(x) + x) Note that this modification does not increase the number of parameters. In fact, the trained residual network can be converted to an identical plain convolutional network and vice versa. So, why does this help? Motivation This part is a bit crazy. When increasing the depth of traditional networks, there is an initial increase in accuracy, then it plateaus, and then, if the depth is further increased, the accuracy will rapidly fall off. I know what your thinking: that the model is being over-fitted, but this is not the case. The training error also increases with increasing depth! Of course, the validation error also increases. As a thought experiment, instead of adding the extra convolutional layers, the ones that increase the training error, add identity functions. This network will be just as easy to train as the original network (since, well, it is practically the same network). So classical neural networks that are too deep, you can improve the training and validation error by replacing some of the convolution layers with identity functions. Instead of replacing the convolution layers with identity functions, they add identity functions as a shortcut connection between multiple convolution layers. This is how I think about it: adding the shortcut connections allows the networks to first learn the optimal solution where the “extra” layers are treated as identity functions. Once it finds this optimum, it is able to use the extra layers to improve on this solution. You might think that the shortcut connections allows a path where the gradient doesn’t dimish, although this may be wrong. The authors mention that the Batch Normalization prevents the gradients from vanishing. They suggest that the plain networks (without shortcuts) may have exponentially low convergence rates. I’m not intending to replace the original paper, so you should read it for more details. Hopefully this give you a good quick overview and motivation to dig in more! Beyond Classical ResNet In addition to the follow-up paper (He et al. 2016) there are some additional papers of go beyond this work. Chu, Yang, and Tadinada (2017) use visualization techniques to explore the internals of residual networks. Zagoruyko and Komodakis (2016) explore making residual networks “wider” - increasing the number of channels - and shallower. Huang et al. (2016) takes the idea of residual connections to the extreme. The basic idea is that each layer is connected to all of the later layers within a “dense block” (all layers have direct feedforward connections). There is also a wide and shallow variaent that follows the strategy of (Zagoruyko and Komodakis 2016). Huang et al. (2016) also introduced stochastic depth residual networks. References Chu, Brian, Daylen Yang, and Ravi Tadinada. 2017. “Visualizing Residual Networks.” ArXiv:1701.02362 [Cs], January. He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” ArXiv:1512.03385 [Cs], December. ———. 2016. “Identity Mappings in Deep Residual Networks.” ArXiv:1603.05027 [Cs], March. Huang, Gao, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. 2016. “Densely Connected Convolutional Networks.” ArXiv:1608.06993 [Cs], August. Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. 2016. “Deep Networks with Stochastic Depth.” ArXiv:1603.09382 [Cs], March. Zagoruyko, Sergey, and Nikos Komodakis. 2016. “Wide Residual Networks.” ArXiv:1605.07146 [Cs], May. Read More ›

Fast command-line navigation, automatic bookmarking, and referencing using fasd
One of my favorite recent productivity discoveries is fasd at It lets you easily jump to directories that you visit frequently. For example, I have a directory named “caffe-help-barebones” as well as “caffe-help-git” and other “caffe-help-*” directories. Regardless of the current directory, I can jump to it by using z caffe-help-barebones, or by just referencing a unique part of the directory name: z barebones. Installation Installing is very easy, which is nice since I quickly wanted to install it on all of my Linux systems. There are additional methods available listed on the website. I prefer to avoid PPAs or installing these types of applications system-wide. If you prefer to use these methods, see: I use the following to install: sudo apt-get install build-essentials pandoc mkdir -p ~/.local/install mkdir -p ~/.local/bin cd ~/.local/install git clone fasd-git cd fasd-git make all PREFIX=~/.local make install Your bashrc should run eval "$(fasd --init auto)". If you want to include it in all of your bashrc’s, even on systems that might not have fasd installed, you can run the following: echo -e '\nif command -v fasd; then\n eval "$(fasd --init auto)"\nfi' >> ~/.bashrc Which appends the following to your .bashrc: if command -v fasd; then eval "$(fasd --init auto)" fi Read More ›

Review of Ioffe & Szegedy 2015 *Batch normalization*
Normalization of training inputs has long been shown to increase the speed of learning in networks. The paper (Ioffe and Szegedy 2015) introduces a major improvement in deep learning, batch normalization (BN), which extends this idea by normalizing the activity within the network, across mini-batches (batches of training examples). BN has been gaining a lot of traction in the academic literature, for example being used to improve segmentation (Hong, Noh, and Han 2015) and variational autoencoders (C. K. Sønderby et al. 2016). The authors state that adding BN allows a version of the Inception image classification model to learn with 14 times fewer training steps, when additional modifications are made in order to take advantage of BN. One of the modifications is removing the Dropout layers, because BN acts as a regularizer and actually eliminates the need for Dropout. It also allows for the learning rate to be increased. It does all this while actually adding a small number of parameters to be learned during training. A non-batch version of BN may even have a biological homolog: homeostatic plasticity. BN separates the learning of the overall distribution of the activity of the neuron and the specific synaptic weights. For each “activation” \(x^{(k)}\), the parameters for the mean and spread of the distribution of the activation is given by the learned parameters \(\beta^{(k)}\) and \(\gamma^{(k)}\) respectively. Details The original paper (Ioffe and Szegedy 2015) states that the normalization should be done per activation \(k\). In the initial part of the paper the definition of activation is left open. In their experiments, however, they do the normalization across each feature map (across batches and locations, for a specific feature). Batch normalization step For now, this section just regurgitates some of the basic information from the original paper. Let \(x^{(k)}_i\) be a specific activation \(k\) for a given input\(i\). Batch normalization then normalizes this activation over all the inputs of the batch (mini-batch) of inputs \(i \in \{ 1 \ldots m \}\). BN normalizes the data to a Gaussian where the mean and variation of the Gaussian is learned during training. This is done by first normalizing the data to the standard Gaussian (\(\mu=0\) and \(\sigma=1\)), and then adding the offsets \(\beta^{(k)}\) and scaling by \(\gamma^{(k)}\). Let \(\mathcal B = \left\{ x_{1 \ldots m}\right\}\) be a given batch. \(\mu ^{k}_ \mathcal B\) and ${( ^{k}_B )} ^ 2 $ be the mean and variance of a given activation, \(k\), across the batch of training inputs. The normalization / whitening step is then: \[ \hat x_i = \frac {x_i - \mu_{\mathcal B}} {\sqrt{\sigma_{\mathcal B}^2 + \epsilon}}. \] And then there is the re-scaling and shifting step: \[ y^{(k)}_i = \gamma^{(k)} \hat x_i + \beta^{(k)}, \] where \(\gamma^{(k)}\) and \(\beta^{(k)}\), once again, are learned parameters. Discussion on Caffe’s Implementation There is an interesting discussion on Caffe’s implementation in the pull request (PR): Modifying models for BN Adding BN by itself can speedup training. However, in order to fully take advantage of the BN, additional steps need to be made. The authors suggestions include: Increase the learning rate (how much?). Remove Dropout. Reduce the L2 weight regularization by a factor of 5. Increase the learning rate decay by 6. Perform “within-shard” shuffling - although I don’t know what this is. References Hong, Seunghoon, Hyeonwoo Noh, and Bohyung Han. 2015. “Decoupled Deep Neural Network for Semi-Supervised Semantic Segmentation.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1495–1503. Curran Associates, Inc. Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” ArXiv:1502.03167 [Cs], February. Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks.” ArXiv:1602.02282 [Cs, Stat], February. Read More ›

Review of Ichida, Schwabe, Bressloff, & Angelucci (2007) 'Response Facilitation From the “Suppressive” Receptive Field Surround of Macaque V1 Neurons'
Overview The extraclassical surround (ECS) generally suppresses the firing rate of visual neurons in the primary visual cortex (V1), especially when the surround stimulus has the same orientation (iso-oriented). However, it has been shown that the ECS can actually enhance the firing rate when the stimulus has a low contrast. In (Ichida et al. 2007), the authors test a prediction from a model they have published (Schwabe et al. 2006): that the far ECS, and not just the immediate ECS, can enhance the response. They find that the far ECS can indeed enhance the response, but only when the immediate ECS does not contain the iso-oriented stimulus. Methods The authors define the classical receptive field (CRF, called the minimum response field in the paper) as the region that can be driven using a small high contrast 0.1° grating. The size of the CRF depends on the contrast of the stimulus. It is larger for low contrast stimuli than high contrast stimuli. The authors define the immediate ECS as the area beyond the high contrast CRF where a low contrast stimulus would increase the response. Instead of using the CRF, they used what they call the high-contrast and low-contrast summation RF. They first found the CRF (minimum response field) by using a 0.1° grating. They then used this to center a high-contrast grating patch. They then varied the size of the patch and found the size that optimally simulated the cell. They called this the high contrast summation RF (SRF_high) or simply the RF center. They used the same protocol with a low contrast grating to find the low-contrast summation RF (SRF_low). They called the region between SRF_high and SRF_low the near surround. The far surround’s outer diameter was set to 14°. The inner diameter varied but was no smaller than the SRF_low. References Ichida, Jennifer M., Lars Schwabe, Paul C. Bressloff, and Alessandra Angelucci. 2007. “Response Facilitation From the ‘Suppressive’ Receptive Field Surround of Macaque V1 Neurons.” Journal of Neurophysiology 98 (4): 2168–81. doi:10.1152/jn.00298.2007. Schwabe, Lars, Klaus Obermayer, Alessandra Angelucci, and Paul C. Bressloff. 2006. “The Role of Feedback in Shaping the Extra-Classical Receptive Field of Cortical Neurons: A Recurrent Network Model.” J. Neurosci. 26 (36): 9117–29. doi:10.1523/JNEUROSCI.1253-06.2006. Read More ›

Review of Zoccolan et al. 2005 Multiple
Even though the neurons in inferotemporal cortex (IT) have very large receptive fields, it is tempting the believe that the neurons would be able to distinguish objects presented within their receptive fields. For example, if a neuron responds to object A and B at different rates, perhaps the neuron should give the maximum of these two rates when both stimuli are presented within their receptive field. The study (Zoccolan, Cox, and DiCarlo 2005) shows that this is not the case and, when presented with two objects, most IT neurons’ responses are the mean of the firing rates when the objects are presented separately - at least for short presentation times and when the objects are not attended. There is a lot more to this paper than what I will cover in this review / note. I hope to add more in the future, but the most important points are straightforward. They use simple artificial shapes on a plain background. The first results show that in the population, the cells’ responses to the presentation of multiple objects cluster around the mean of their responses of when the objects are presented separately. There is slight tendency to fire at a rate slightly higher than the average, but the lack of scatter is rather amazing. There is a line in Figure 1C and 1D for the sum responses and very few of the cells fall on or above this line. They then show that the responses to the combined object displays are much more like the mean of the responses to individual object displays than a max model, at least in the mean cell population. There is a lot of spread in these results, leaving open the possibility that some neurons give a response that is the maximum of the response to the two objects separately (or having an even higher response). Zoccolan, Davide, David D. Cox, and James J. DiCarlo. 2005. “Multiple Object Response Normalization in Monkey Inferotemporal Cortex.” J. Neurosci. 25 (36): 8150–64. doi:10.1523/JNEUROSCI.2058-05.2005. Read More ›

Review of Liu, Hashemi-Nezhad, & Lyon (2015) 'Contrast invariance of orientation tuning in cat primary visual cortex neurons depends on stimulus size'
Overview There are two main findings from (Liu, Hashemi-Nezhad, and Lyon 2015) in the the primary visual cortex (V1) using anesthetized cat. First, that contrast invariance orientation tuning depends on having a stimulus that extends beyond the CRF. If the stimulus is optimized for the CRF, then the tuning width decreases with lower contrast (illustrated in Figure 3 of the paper). The orientation tuning profile is invariant when the stimulus extends to the surround, but when is only covers the CRF. The second main finding (illustrated in Figure 4 of the paper) is that contrast invariance appears with the large stimulus because the tuning width decreases in the high contrast stimulus when the surround stimulus is added to the CRF stimulus. The tuning width for the low contrast conditions on average stays the same with or without the stimulus in the surround (although individual cells may be facilitated or suppressed). This results of (Liu, Hashemi-Nezhad, and Lyon 2015) are difficult to reconcile with classical results and, for me, indicate that a better measure of contrast-invariant orientation tuning is needed. This paper should definitely be read for anyone interested in this feature. Stimulus and Methods For the main experiment, they have two contrast conditions (low and high) that are defined for each neuron and two size conditions (CRF and CRF+ECS) that are defined for each contrast (and neuron). The smaller of the two sizes, the CRF / patch condition, is defined as the size that produces the largest response from the cell. The larger size, the CRF+ECS (extraclassical surround) condition, is defined by the size that produces the maximum suppression. The paper almost exclusively reports the half-width at half height (HWHH). This is half the width of the (fitted) orientation tuning curve that elicits half of the maximum response of that tuning curve. Discussion The paper states in the discussion that most other papers on this topic did not use the optimally sized stimulus, hence why they report different results. They do point out that (???) did use a similar CRF condition, but reported different results presumably because they used patch clamping. In Supplemental Fig. 3 of Finn et al., there are some extracellularly recorded neurons that reportedly are more consistent (I haven’t checked results). Deep anesthesia is known to change the properties of ECS of early visual neurons. It is unclear to me how much the results from anesthesized animals can be generalized to the normal awake state. Liu, Yong-Jun, Maziar Hashemi-Nezhad, and David C. Lyon. 2015. “Contrast Invariance of Orientation Tuning in Cat Primary Visual Cortex Neurons Depends on Stimulus Size.” J Physiol 593 (19): 4485–98. doi:10.1113/JP271180. Read More ›

Change konsole appearance during SSH
Everyone knows that feeling: when you have many consoles open at the same time connected via ssh to various servers. In this post I’m going to show a simple trick that allows you to change the background whenever you ssh to a server and changes it back when you logout - well, at least if you are using KDE (or have konsole installed). For example, I have a virtual linux system that I call “Puffin”. I’ve created an alias “ssh-puffin” to login via ssh. Before ssh session I have setup this alias to change the background of konsole: During ssh session And then, after I log out, the konsole switches back to the local profile (and gives a warm and fuzzy welcome-back message). After ssh session Step 1: Add konsole profile(s) Create konsole profiles and corresponding color schemes for your local system (“Local”) and remote systems (“Puffin”). You only need to really create the color schemes, but I always create a separate profile with the same name. This is done by going to Settings of a konsole window and selecting “Manage Profiles”. You can access the color schemes by clicking edit (or new) and then clicking on Appearance. I created the Puffin background with GIMP using layers and an image from Wikimedia Commons by Richard Bartz. You can, of course, change the console appearance in other ways. Step 2: Modify .bashrc Add the following to your .bashrc file: alias resetcolors="konsoleprofile colors=Local" alias ssh-puffin="konsoleprofile colors=Puffin; ssh puffin; resetcolors; echo 'Welcome back'"'!' If you have many remote servers, you may want to add your .bashrc file to github or the cloud™. Step 3: Enjoy awesomeness After reloading .bashrc, you can then log into the server using your alias. Acknowledgements I first figured out how to do this from a blog post by Abdussamad. Read More ›

Backpropagation with shared weights in convolutional neural networks
The success of deep convolutional neural networks would not be possible without weight sharing - the same weights being applied to different neuronal connections. However, this property also makes them more complicated. This post aims to give an intuition of how backpropagation works with weight sharing. For a more well-rounded introduction to backpropagation of convolutional neural networks, see Andrew Gibiansky’s blog post. Backpropagation is used to calculate how the error in a neural network changes with respect to changes in a weight \(w\) in that neural network. In other words, it calculates: \[\frac{\partial E}{\partial w}, \] where \(E\) is the error and \(w\) is a weight. For traditional feed-forward neural networks, each connection between two neurons has it’s own weight and the calculation of the backpropagation is generally straightforward using the chain rule. For example, if you know how the error changes with respect the node \(y_i\) (ie. \(\frac{\partial E}{\partial y_i}\)), then calculating the contribution of the pre-synaptic weights of that node is simply: \[\frac{\partial E}{\partial w}=\frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial w}. \] This is complicated in convolutional neural networks because the weight \(w\) is used for multiple nodes (often, most or all nodes in the same layer). Handling shared weights In classical convolutional neural networks, shared weights are handled by summing together each instance that the weight appears in backpropagation derivation, instead of, for example, taking the average of each occurrence. So, if layer \(y^l\) is the layer “post-synaptic” to the weight \(w\) and we have calculated the effect of layer on the error (\(\frac{\partial E}{\partial y^l}\)), then the weights are: \[\frac{\partial E}{\partial w}=\sum_i\frac{\partial E}{\partial y^l_i} \frac{\partial y^l_i}{\partial w}, \] where \(i\) specifies the node within layer \(l\). So why is summation the correct operation? In essence, it is because when the paths from a weight (applied at different locations) merge, they do so with summation. For example, convolution involves summing the paths (in the dot-operation). Other operations such as max pooling and fully connected layers also involve summing the separate paths. Simple example Let’s take a very simple convolutional network. Let layer \(y^0\) be a 2D input layer and \([w_0, 0, 0]\) a kernel that is applied to this convolutional layer. For simplicity, lets only have a single kernel. Then: \[ x^1_{i}=w_0 y^0_{i} \] An activation function is then applied to this result: \(y^1_i=h(x^1_{i})\). For the next convolutional layer, let’s say that the kernel \([w_1,w_2,w_3]\) is applied. Then: \[ \begin{aligned} x^2_{i}&=\sum_{a=1}^3 w_a y^1_{i+a-1} \\ &= w_1 y^1_i + w_2 y^1_{i+1} + w_3 y^1_{i+2} \\ &= w_1 h\left(w_0 y^0_{i}\right) + w_2 h\left(w_0 y^0_{i+1}\right) + w_3 h\left(w_0 y^0_{i+2}\right). \\ \end{aligned} \] and \[ y^2_{i} = h(x^2_{i}). \] So we are interested in \(\frac{\partial E}{\partial w_0}\). Let’s say that the error is only effected by the \(j\)th node of the output: \(y^2_{j}\). Then: \[\frac{\partial E}{\partial w_0} = \frac{\partial E}{\partial y^2_{i}}\frac{\partial y^2_{j}}{\partial x^2_j}\frac{\partial x^2_{j}}{\partial w_0} \] Assume that we have \(\frac{\partial E}{\partial y^2_{j}}\) and \(\frac{\partial y^2_{j}}{\partial x^2_j}\), then we only need to solve for \(\frac{\partial x^2_{j}}{\partial w_0}\). \[ \begin{aligned} \frac{\partial x^2_{j}}{\partial w_0}&=\frac{\partial}{\partial w_0} \left(\sum_{a=1}^3 w_a y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( h\left(w_0 y^0_{j+a-1}\right)\right)\\ &= w_1 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j}\right) + w_2 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+1}\right) + w_3 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+2}\right). \\ \end{aligned} \] Notice that each occurrence of \(w_0\) is summed separately, and hence why backpropagation sums the shared weights in convolutional networks. Read More ›

Passwordless ssh authentication!
In your local system, check to see if you have the following files: ~/.ssh/id_rsa ~/.ssh/ If not, type: ssh-keygen -t rsa And follow the instructions. Note that ssh-agent can be used to securely save your passphrase. After you have generate your private and public keys, you want to give your remote system the public key: ssh-copy-id -i ~/.ssh/ username@remote.system After entering your password, you’re done! Reference: Read More ›