http://neural.vision/neural.vision2017-10-23T18:45:08+00:00This is a blog about vision: visual neuroscience and computer vision, especially deep convolutional neural networks. I will also be posting Linux tips.
Jonathan R. WillifordJekyllhttp://neural.vision/blog/article-reviews/deep-learning/he-resnet-2015/Review of He et al. 2015 *Deep Residual Learning for Image Recognition*2017-06-05T00:00:00+00:00jonathan<p>This is the classic “ResNet” or Residual Network paper <span class="citation" data-cites="he_deep_2015">(He et al. 2015)</span>, which describes a method of making convolution neural networks with a depth of up to 152 layers trainable. The residual networks described in this paper won the ILSVRC 2015 classification task and many other competitions.</p>
<p>Ideas such as vanishing gradients are useful for understanding the paper. One of the important take-aways from this paper, though, is that preventing gradients from vanishing (or exploding) does not necessarily make it practical to find optimal solutions to very deep models.</p>
<p>The first author, Kaiming He, provides the Caffe model for the network and other useful resources at:</p>
<ul>
<li><a href="https://github.com/KaimingHe/deep-residual-networks" class="uri">https://github.com/KaimingHe/deep-residual-networks</a></li>
</ul>
<p>His CVPR 2016 talk, which reviews this paper, is available on <a href="https://www.youtube.com/watch?v=C6tLw-rPQ2o&t=1s">YouTube</a>.</p>
<p>This paper provides three main “take-away” models (plus the ensemble models). In this review, I am just going to talk about what a “residual” network is and the idea behind it, and I will not talk about the complete models. The main models are:</p>
<ul>
<li><a href="http://ethereon.github.io/netscope/#/gist/db945b393d40bfa26006">ResNet-50</a></li>
<li><a href="http://ethereon.github.io/netscope/#/gist/b21e2aae116dc1ac7b50">ResNet-101</a></li>
<li><a href="http://ethereon.github.io/netscope/#/gist/d38f3e6091952b45198b">ResNet-152</a></li>
</ul>
<p>There is a follow-up paper from the same authors, <span class="citation" data-cites="he_identity_2016">(He et al. 2016)</span>, that goes into more optimization of the architecture and is able to train up to 1,000 layers.</p>
<h1 id="what-is-a-residual-network">What is a residual network?</h1>
<p>I will first describe the implementation of residual networks, because it is incredibly easy, while the reasoning behind it seems unintuitive at first. It is also simple to implement in any of the major deep learning frameworks. A basic feedforward convolutional network will often contain consecutive convolution layers, like the subfigure A below:</p>
<figure>
<img src="/images/classic-resnet-simplified.png" title="Part of a convolutional neural network (A) with out and (B) with shortcut connections." alt="Part of a convolutional neural network (A) with out and (B) with shortcut connections." /><figcaption>Part of a convolutional neural network (A) with out and (B) with shortcut connections.</figcaption>
</figure>
<p>Where the “in” and “out” blocks are the input and output activations (“blobs”) of this segment of the network.</p>
<p>This could also be written as:</p>
<pre><code>H(x) = relu(conv(x))</code></pre>
<p>where <span class="math inline">\(conv\)</span> performs a convolution with bias and then performs batch normalization (with scaling and bias).</p>
<p>A residual network, on the other hand, adds a shortcut connection, as shown in subfigure B above. The addition operation is an element-wise addition. This could be written as:</p>
<pre><code>F(x) = relu(conv(x) + x)</code></pre>
<p>Note that this modification does not increase the number of parameters. In fact, the trained residual network can be converted to an identical plain convolutional network and vice versa.</p>
<p>So, why does this help?</p>
<h1 id="motivation">Motivation</h1>
<p>This part is a bit crazy. When increasing the depth of traditional networks, there is an initial increase in accuracy, then it plateaus, and then, if the depth is further increased, the accuracy will rapidly fall off. I know what your thinking: that the model is being over-fitted, but this is not the case. The <strong>training</strong> error also increases with increasing depth! Of course, the validation error also increases.</p>
<p>As a thought experiment, instead of adding the extra convolutional layers, the ones that increase the training error, add identity functions. This network will be just as easy to train as the original network (since, well, it is practically the same network). So classical neural networks that are too deep, you can improve the training and validation error by replacing some of the convolution layers with identity functions.</p>
<p>Instead of replacing the convolution layers with identity functions, they add identity functions as a shortcut connection between multiple convolution layers. This is how I think about it: adding the shortcut connections allows the networks to first learn the optimal solution where the “extra” layers are treated as identity functions. Once it finds this optimum, it is able to use the extra layers to improve on this solution.</p>
<p>You might think that the shortcut connections allows a path where the gradient doesn’t dimish, although this may be wrong. The authors mention that the Batch Normalization prevents the gradients from vanishing. They suggest that the plain networks (without shortcuts) may have exponentially low convergence rates.</p>
<p>I’m not intending to replace the original paper, so you should read it for more details. Hopefully this give you a good quick overview and motivation to dig in more!</p>
<h1 id="beyond-classical-resnet">Beyond Classical ResNet</h1>
<p>In addition to the follow-up paper <span class="citation" data-cites="he_identity_2016">(He et al. 2016)</span> there are some additional papers of go beyond this work. Chu, Yang, and Tadinada <span class="citation" data-cites="chu_visualizing_2017">(2017)</span> use visualization techniques to explore the internals of residual networks. Zagoruyko and Komodakis <span class="citation" data-cites="zagoruyko_wide_2016">(2016)</span> explore making residual networks “wider” - increasing the number of channels - and shallower. Huang et al. <span class="citation" data-cites="huang_densely_2016">(2016)</span> takes the idea of residual connections to the extreme. The basic idea is that each layer is connected to all of the later layers within a “dense block” (all layers have direct feedforward connections). There is also a wide and shallow variaent that follows the strategy of <span class="citation" data-cites="zagoruyko_wide_2016">(Zagoruyko and Komodakis 2016)</span>. Huang et al. <span class="citation" data-cites="huang_deep_2016">(2016)</span> also introduced stochastic depth residual networks.</p>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-chu_visualizing_2017">
<p>Chu, Brian, Daylen Yang, and Ravi Tadinada. 2017. “Visualizing Residual Networks.” <em>ArXiv:1701.02362 [Cs]</em>, January. <a href="http://arxiv.org/abs/1701.02362" class="uri">http://arxiv.org/abs/1701.02362</a>.</p>
</div>
<div id="ref-he_deep_2015">
<p>He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” <em>ArXiv:1512.03385 [Cs]</em>, December. <a href="http://arxiv.org/abs/1512.03385" class="uri">http://arxiv.org/abs/1512.03385</a>.</p>
</div>
<div id="ref-he_identity_2016">
<p>———. 2016. “Identity Mappings in Deep Residual Networks.” <em>ArXiv:1603.05027 [Cs]</em>, March. <a href="http://arxiv.org/abs/1603.05027" class="uri">http://arxiv.org/abs/1603.05027</a>.</p>
</div>
<div id="ref-huang_densely_2016">
<p>Huang, Gao, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten. 2016. “Densely Connected Convolutional Networks.” <em>ArXiv:1608.06993 [Cs]</em>, August. <a href="http://arxiv.org/abs/1608.06993" class="uri">http://arxiv.org/abs/1608.06993</a>.</p>
</div>
<div id="ref-huang_deep_2016">
<p>Huang, Gao, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. 2016. “Deep Networks with Stochastic Depth.” <em>ArXiv:1603.09382 [Cs]</em>, March. <a href="http://arxiv.org/abs/1603.09382" class="uri">http://arxiv.org/abs/1603.09382</a>.</p>
</div>
<div id="ref-zagoruyko_wide_2016">
<p>Zagoruyko, Sergey, and Nikos Komodakis. 2016. “Wide Residual Networks.” <em>ArXiv:1605.07146 [Cs]</em>, May. <a href="http://arxiv.org/abs/1605.07146" class="uri">http://arxiv.org/abs/1605.07146</a>.</p>
</div>
</div>
2017-06-05T00:00:00+00:00http://neural.vision/blog/linux/fasd/Fast command-line navigation, automatic bookmarking, and referencing using fasd2016-12-30T18:30:00+00:00jonathan<p>One of my favorite recent productivity discoveries is <code>fasd</code> at <a href="https://github.com/clvv/fasd" class="uri">https://github.com/clvv/fasd</a>. It lets you easily jump to directories that you visit frequently.</p>
<p>For example, I have a directory named “caffe-help-barebones” as well as “caffe-help-git” and other “caffe-help-*” directories. Regardless of the current directory, I can jump to it by using <code>z caffe-help-barebones</code>, or by just referencing a unique part of the directory name: <code>z barebones</code>.</p>
<h1 id="installation">Installation</h1>
<p>Installing is very easy, which is nice since I quickly wanted to install it on all of my Linux systems. There are additional methods available listed on the website. I prefer to avoid PPAs or installing these types of applications system-wide. If you prefer to use these methods, see: <a href="https://github.com/clvv/fasd/wiki/Installing-via-Package-Managers" class="uri">https://github.com/clvv/fasd/wiki/Installing-via-Package-Managers</a>.</p>
<p>I use the following to install:</p>
<pre><code>sudo apt-get install build-essentials pandoc
mkdir -p ~/.local/install
mkdir -p ~/.local/bin
cd ~/.local/install
git clone git@github.com:clvv/fasd.git fasd-git
cd fasd-git
make all
PREFIX=~/.local make install</code></pre>
<p>Your bashrc should run <code>eval "$(fasd --init auto)"</code>. If you want to include it in all of your bashrc’s, even on systems that might not have <code>fasd</code> installed, you can run the following:</p>
<pre><code>echo -e '\nif command -v fasd; then\n eval "$(fasd --init auto)"\nfi' >> ~/.bashrc</code></pre>
<p>Which appends the following to your <code>.bashrc</code>:</p>
<pre><code>if command -v fasd; then
eval "$(fasd --init auto)"
fi</code></pre>
<div id="refs" class="references">
</div>
2016-12-30T18:30:00+00:00http://neural.vision/blog/article-reviews/deep-learning/ioffe-batch-2015/Review of Ioffe & Szegedy 2015 *Batch normalization*2016-04-16T00:00:00+00:00jonathan<p>Normalization of training inputs has long been shown to increase the speed of learning in networks. The paper <span class="citation" data-cites="ioffe_batch_2015">(Ioffe and Szegedy 2015)</span> introduces a major improvement in deep learning, batch normalization (BN), which extends this idea by normalizing the activity <strong>within</strong> the network, across mini-batches (batches of training examples).</p>
<p>BN has been gaining a lot of traction in the academic literature, for example being used to improve segmentation <span class="citation" data-cites="hong_decoupled_2015">(Hong, Noh, and Han 2015)</span> and variational autoencoders <span class="citation" data-cites="sonderby_how_2016">(C. K. Sønderby et al. 2016)</span>.</p>
<p>The authors state that adding BN allows a version of the Inception image classification model to learn with 14 times fewer training steps, when additional modifications are made in order to take advantage of BN. One of the modifications is removing the Dropout layers, because BN acts as a regularizer and actually eliminates the need for Dropout. It also allows for the learning rate to be increased. It does all this while actually adding a small number of parameters to be learned during training. A non-batch version of BN may even have a biological homolog: homeostatic plasticity.</p>
<p>BN separates the learning of the overall distribution of the activity of the neuron and the specific synaptic weights. For each “activation” <span class="math inline">\(x^{(k)}\)</span>, the parameters for the mean and spread of the distribution of the activation is given by the learned parameters <span class="math inline">\(\beta^{(k)}\)</span> and <span class="math inline">\(\gamma^{(k)}\)</span> respectively.</p>
<h1 id="details">Details</h1>
<p>The original paper <span class="citation" data-cites="ioffe_batch_2015">(Ioffe and Szegedy 2015)</span> states that the normalization should be done per activation <span class="math inline">\(k\)</span>. In the initial part of the paper the definition of activation is left open. In their experiments, however, they do the normalization across each feature map (across batches <strong>and</strong> locations, for a specific feature).</p>
<h2 id="batch-normalization-step">Batch normalization step</h2>
<p>For now, this section just regurgitates some of the basic information from the original paper.</p>
<p>Let <span class="math inline">\(x^{(k)}_i\)</span> be a specific activation <span class="math inline">\(k\)</span> for a given input<span class="math inline">\(i\)</span>. Batch normalization then normalizes this activation over all the inputs of the batch (mini-batch) of inputs <span class="math inline">\(i \in \{ 1 \ldots m \}\)</span>.</p>
<p>BN normalizes the data to a Gaussian where the mean and variation of the Gaussian is learned during training. This is done by first normalizing the data to the standard Gaussian (<span class="math inline">\(\mu=0\)</span> and <span class="math inline">\(\sigma=1\)</span>), and then adding the offsets <span class="math inline">\(\beta^{(k)}\)</span> and scaling by <span class="math inline">\(\gamma^{(k)}\)</span>.</p>
<p>Let <span class="math inline">\(\mathcal B = \left\{ x_{1 \ldots m}\right\}\)</span> be a given batch.</p>
<p><span class="math inline">\(\mu ^{k}_ \mathcal B\)</span> and ${( ^{k}_B )} ^ 2 $ be the mean and variance of a given activation, <span class="math inline">\(k\)</span>, across the batch of training inputs.</p>
<p>The normalization / whitening step is then:</p>
<p><span class="math display">\[
\hat x_i = \frac
{x_i - \mu_{\mathcal B}}
{\sqrt{\sigma_{\mathcal B}^2 + \epsilon}}.
\]</span></p>
<p>And then there is the re-scaling and shifting step:</p>
<p><span class="math display">\[
y^{(k)}_i = \gamma^{(k)} \hat x_i + \beta^{(k)},
\]</span></p>
<p>where <span class="math inline">\(\gamma^{(k)}\)</span> and <span class="math inline">\(\beta^{(k)}\)</span>, once again, are learned parameters.</p>
<h2 id="discussion-on-caffes-implementation">Discussion on Caffe’s Implementation</h2>
<p>There is an interesting discussion on Caffe’s implementation in the pull request (PR):</p>
<p>https://github.com/BImplementationVLC/caffe/pull/3229</p>
<h1 id="modifying-models-for-bn">Modifying models for BN</h1>
<p>Adding BN by itself can speedup training. However, in order to fully take advantage of the BN, additional steps need to be made. The authors suggestions include:</p>
<ul>
<li>Increase the learning rate (how much?).</li>
<li>Remove Dropout.</li>
<li>Reduce the L<sub>2</sub> weight regularization by a factor of 5.</li>
<li>Increase the learning rate decay by 6.</li>
<li>Perform “within-shard” shuffling - although I don’t know what this is.</li>
</ul>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-hong_decoupled_2015">
<p>Hong, Seunghoon, Hyeonwoo Noh, and Bohyung Han. 2015. “Decoupled Deep Neural Network for Semi-Supervised Semantic Segmentation.” In <em>Advances in Neural Information Processing Systems 28</em>, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1495–1503. Curran Associates, Inc. <a href="http://papers.nips.cc/paper/5858-decoupled-deep-neural-network-for-semi-supervised-semantic-segmentation.pdf" class="uri">http://papers.nips.cc/paper/5858-decoupled-deep-neural-network-for-semi-supervised-semantic-segmentation.pdf</a>.</p>
</div>
<div id="ref-ioffe_batch_2015">
<p>Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” <em>ArXiv:1502.03167 [Cs]</em>, February. <a href="http://arxiv.org/abs/1502.03167" class="uri">http://arxiv.org/abs/1502.03167</a>.</p>
</div>
<div id="ref-sonderby_how_2016">
<p>Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks.” <em>ArXiv:1602.02282 [Cs, Stat]</em>, February. <a href="http://arxiv.org/abs/1602.02282" class="uri">http://arxiv.org/abs/1602.02282</a>.</p>
</div>
</div>
2016-04-16T00:00:00+00:00http://neural.vision/blog/article-reviews/visual-neuroscience/ichida-response-2007/Review of Ichida, Schwabe, Bressloff, & Angelucci (2007) 'Response Facilitation From the “Suppressive” Receptive Field Surround of Macaque V1 Neurons'2016-03-19T00:00:00+00:00jonathan<h1 id="overview">Overview</h1>
<p>The extraclassical surround (ECS) generally suppresses the firing rate of visual neurons in the primary visual cortex (V1), especially when the surround stimulus has the same orientation (iso-oriented). However, it has been shown that the ECS can actually enhance the firing rate when the stimulus has a low contrast. In <span class="citation" data-cites="ichida_response_2007">(Ichida et al. 2007)</span>, the authors test a prediction from a model they have published <span class="citation" data-cites="schwabe_role_2006">(Schwabe et al. 2006)</span>: that the far ECS, and not just the immediate ECS, can enhance the response. They find that the far ECS can indeed enhance the response, but only when the immediate ECS does not contain the iso-oriented stimulus.</p>
<h1 id="methods">Methods</h1>
<p>The authors define the classical receptive field (CRF, called the minimum response field in the paper) as the region that can be driven using a small high contrast 0.1° grating. The size of the CRF depends on the contrast of the stimulus.</p>
<p>It is larger for low contrast stimuli than high contrast stimuli. The authors define the immediate ECS as the area beyond the high contrast CRF where a low contrast stimulus would increase the response.</p>
<p>Instead of using the CRF, they used what they call the high-contrast and low-contrast summation RF. They first found the CRF (minimum response field) by using a 0.1° grating. They then used this to center a high-contrast grating patch. They then varied the size of the patch and found the size that optimally simulated the cell. They called this the high contrast summation RF (SRF_high) or simply the RF center. They used the same protocol with a low contrast grating to find the low-contrast summation RF (SRF_low).</p>
<p>They called the region between SRF_high and SRF_low the near surround. The far surround’s outer diameter was set to 14°. The inner diameter varied but was no smaller than the SRF_low.</p>
<h1 id="references" class="unnumbered">References</h1>
<div id="refs" class="references">
<div id="ref-ichida_response_2007">
<p>Ichida, Jennifer M., Lars Schwabe, Paul C. Bressloff, and Alessandra Angelucci. 2007. “Response Facilitation From the ‘Suppressive’ Receptive Field Surround of Macaque V1 Neurons.” <em>Journal of Neurophysiology</em> 98 (4): 2168–81. doi:<a href="https://doi.org/10.1152/jn.00298.2007">10.1152/jn.00298.2007</a>.</p>
</div>
<div id="ref-schwabe_role_2006">
<p>Schwabe, Lars, Klaus Obermayer, Alessandra Angelucci, and Paul C. Bressloff. 2006. “The Role of Feedback in Shaping the Extra-Classical Receptive Field of Cortical Neurons: A Recurrent Network Model.” <em>J. Neurosci.</em> 26 (36): 9117–29. doi:<a href="https://doi.org/10.1523/JNEUROSCI.1253-06.2006">10.1523/JNEUROSCI.1253-06.2006</a>.</p>
</div>
</div>
2016-03-19T00:00:00+00:00http://neural.vision/blog/article-reviews/visual-neuroscience/zoccolan-multiple-2005/Review of Zoccolan et al. 2005 Multiple2016-02-22T00:00:00+00:00jonathan<p>Even though the neurons in inferotemporal cortex (IT) have very large receptive fields, it is tempting the believe that the neurons would be able to distinguish objects presented within their receptive fields. For example, if a neuron responds to object A and B at different rates, perhaps the neuron should give the maximum of these two rates when both stimuli are presented within their receptive field. The study <span class="citation" data-cites="zoccolan_multiple_2005">(Zoccolan, Cox, and DiCarlo 2005)</span> shows that this is not the case and, when presented with two objects, most IT neurons’ responses are the mean of the firing rates when the objects are presented separately - at least for short presentation times and when the objects are not attended.</p>
<p>There is a lot more to this paper than what I will cover in this review / note. I hope to add more in the future, but the most important points are straightforward. They use simple artificial shapes on a plain background. The first results show that in the population, the cells’ responses to the presentation of multiple objects cluster around the mean of their responses of when the objects are presented separately. There is slight tendency to fire at a rate slightly higher than the average, but the lack of scatter is rather amazing. There is a line in Figure 1C and 1D for the sum responses and very few of the cells fall on or above this line.</p>
<p>They then show that the responses to the combined object displays are much more like the mean of the responses to individual object displays than a max model, at least in the mean cell population. There is a lot of spread in these results, leaving open the possibility that some neurons give a response that is the maximum of the response to the two objects separately (or having an even higher response).</p>
<div id="refs" class="references">
<div id="ref-zoccolan_multiple_2005">
<p>Zoccolan, Davide, David D. Cox, and James J. DiCarlo. 2005. “Multiple Object Response Normalization in Monkey Inferotemporal Cortex.” <em>J. Neurosci.</em> 25 (36): 8150–64. doi:<a href="https://doi.org/10.1523/JNEUROSCI.2058-05.2005">10.1523/JNEUROSCI.2058-05.2005</a>.</p>
</div>
</div>
2016-02-22T00:00:00+00:00http://neural.vision/blog/article-reviews/visual-neuroscience/liu-contrast-2015/Review of Liu, Hashemi-Nezhad, & Lyon (2015) 'Contrast invariance of orientation tuning in cat primary visual cortex neurons depends on stimulus size'2016-02-20T00:00:00+00:00jonathan<h1 id="overview">Overview</h1>
<p>There are two main findings from <span class="citation" data-cites="liu_contrast_2015">(Liu, Hashemi-Nezhad, and Lyon 2015)</span> in the the primary visual cortex (V1) using anesthetized cat. First, that contrast invariance orientation tuning depends on having a stimulus that extends beyond the CRF. If the stimulus is optimized for the CRF, then the tuning width decreases with lower contrast (illustrated in Figure 3 of the paper). The orientation tuning profile is invariant when the stimulus extends to the surround, but when is only covers the CRF.</p>
<p>The second main finding (illustrated in Figure 4 of the paper) is that contrast invariance appears with the large stimulus because the tuning width <em>decreases</em> in the high contrast stimulus when the surround stimulus is added to the CRF stimulus. The tuning width for the low contrast conditions on average stays the same with or without the stimulus in the surround (although individual cells may be facilitated or suppressed).</p>
<p>This results of <span class="citation" data-cites="liu_contrast_2015">(Liu, Hashemi-Nezhad, and Lyon 2015)</span> are difficult to reconcile with classical results and, for me, indicate that a better measure of contrast-invariant orientation tuning is needed. This paper should definitely be read for anyone interested in this feature.</p>
<h1 id="stimulus-and-methods">Stimulus and Methods</h1>
<p>For the main experiment, they have two contrast conditions (low and high) that are defined for each neuron and two size conditions (CRF and CRF+ECS) that are defined for each contrast (and neuron). The smaller of the two sizes, the CRF / patch condition, is defined as the size that produces the largest response from the cell. The larger size, the CRF+ECS (extraclassical surround) condition, is defined by the size that produces the maximum suppression.</p>
<p>The paper almost exclusively reports the half-width at half height (HWHH). This is half the width of the (fitted) orientation tuning curve that elicits half of the maximum response of that tuning curve.</p>
<h1 id="discussion">Discussion</h1>
<p>The paper states in the discussion that most other papers on this topic did not use the optimally sized stimulus, hence why they report different results. They do point out that <span class="citation" data-cites="finn_contrast-invariant_2007">(<span class="citeproc-not-found" data-reference-id="finn_contrast-invariant_2007"><strong>???</strong></span>)</span> did use a similar CRF condition, but reported different results presumably because they used patch clamping. In Supplemental Fig. 3 of Finn et al., there are some extracellularly recorded neurons that reportedly are more consistent (I haven’t checked results).</p>
<p>Deep anesthesia is known to change the properties of ECS of early visual neurons. It is unclear to me how much the results from anesthesized animals can be generalized to the normal awake state.</p>
<div id="refs" class="references">
<div id="ref-liu_contrast_2015">
<p>Liu, Yong-Jun, Maziar Hashemi-Nezhad, and David C. Lyon. 2015. “Contrast Invariance of Orientation Tuning in Cat Primary Visual Cortex Neurons Depends on Stimulus Size.” <em>J Physiol</em> 593 (19): 4485–98. doi:<a href="https://doi.org/10.1113/JP271180">10.1113/JP271180</a>.</p>
</div>
</div>
2016-02-20T00:00:00+00:00http://neural.vision/blog/linux/konsole-ssh-awesomeness/Change konsole appearance during SSH2016-01-02T03:19:00+00:00jonathan<p><em>Everyone</em> knows that feeling: when you have many consoles open at the same time connected via ssh to various servers. In this post I’m going to show a simple trick that allows you to change the background whenever you ssh to a server and changes it back when you logout - well, at least if you are using KDE (or have konsole installed).</p>
<p>For example, I have a virtual linux system that I call “Puffin”. I’ve created an alias “ssh-puffin” to login via ssh.</p>
<figure>
<img src="http://neural.vision/images/konsole-ssh-awesomeness-1.png" alt="Before ssh session" /><figcaption>Before ssh session</figcaption>
</figure>
<p>I have setup this alias to change the background of konsole:</p>
<figure>
<img src="http://neural.vision/images/konsole-ssh-awesomeness-2.png" alt="During ssh session" /><figcaption>During ssh session</figcaption>
</figure>
<p>And then, after I log out, the konsole switches back to the local profile (and gives a warm and fuzzy welcome-back message).</p>
<figure>
<img src="http://neural.vision/images/konsole-ssh-awesomeness-3.png" alt="After ssh session" /><figcaption>After ssh session</figcaption>
</figure>
<h2 id="step-1-add-konsole-profiles">Step 1: Add konsole profile(s)</h2>
<p>Create konsole profiles and corresponding color schemes for your local system (“Local”) and remote systems (“Puffin”). You only need to really create the color schemes, but I always create a separate profile with the same name. This is done by going to Settings of a konsole window and selecting “Manage Profiles”. You can access the color schemes by clicking edit (or new) and then clicking on Appearance.</p>
<p>I created the Puffin background with GIMP using layers and an <a href="https://commons.wikimedia.org/wiki/File:Papageitaucher_Fratercula_arctica.jpg">image from Wikimedia Commons</a> by <a href="https://commons.wikimedia.org/wiki/User:Richard_Bartz">Richard Bartz</a>.</p>
<p>You can, of course, change the console appearance in other ways.</p>
<h2 id="step-2-modify-.bashrc">Step 2: Modify .bashrc</h2>
<p>Add the following to your .bashrc file:</p>
<pre class="ssh"><code>alias resetcolors="konsoleprofile colors=Local"
alias ssh-puffin="konsoleprofile colors=Puffin; ssh puffin; resetcolors; echo 'Welcome back'"'!'</code></pre>
<p>If you have many remote servers, you may want to add your .bashrc file to github or the cloud™.</p>
<h2 id="step-3-enjoy-awesomeness">Step 3: Enjoy awesomeness</h2>
<p>After reloading <code>.bashrc</code>, you can then log into the server using your alias.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>I first figured out how to do this from a <a href="https://abdussamad.com/archives/503-Changing-Konsole-colors-in-KDE.html">blog post by Abdussamad</a>.</p>
<div id="refs" class="references">
</div>
2016-01-02T03:19:00+00:00http://neural.vision/blog/deep-learning/backpropagation-with-shared-weights/Backpropagation with shared weights in convolutional neural networks2015-12-23T00:00:00+00:00jonathan<p>The success of deep convolutional neural networks would not be possible without weight sharing - the same weights being applied to different neuronal connections. However, this property also makes them more complicated. This post aims to give an intuition of how backpropagation works with weight sharing. For a more well-rounded introduction to backpropagation of convolutional neural networks, see Andrew Gibiansky’s <a href="http://andrew.gibiansky.com/blog/machine-learning/convolutional-neural-networks/">blog post</a>.</p>
<p>Backpropagation is used to calculate how the error in a neural network changes with respect to changes in a weight <span class="math inline">\(w\)</span> in that neural network. In other words, it calculates:</p>
<p><span class="math display">\[\frac{\partial E}{\partial w},
\]</span></p>
<p>where <span class="math inline">\(E\)</span> is the error and <span class="math inline">\(w\)</span> is a weight.</p>
<p>For traditional feed-forward neural networks, each connection between two neurons has it’s own weight and the calculation of the backpropagation is generally straightforward using the chain rule. For example, if you know how the error changes with respect the node <span class="math inline">\(y_i\)</span> (ie. <span class="math inline">\(\frac{\partial E}{\partial y_i}\)</span>), then calculating the contribution of the pre-synaptic weights of that node is simply:</p>
<p><span class="math display">\[\frac{\partial E}{\partial w}=\frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial w}.
\]</span></p>
<p>This is complicated in convolutional neural networks because the weight <span class="math inline">\(w\)</span> is used for multiple nodes (often, most or all nodes in the same layer).</p>
<h1 id="handling-shared-weights">Handling shared weights</h1>
<p>In classical convolutional neural networks, shared weights are handled by summing together each instance that the weight appears in backpropagation derivation, instead of, for example, taking the average of each occurrence. So, if layer <span class="math inline">\(y^l\)</span> is the layer “post-synaptic” to the weight <span class="math inline">\(w\)</span> and we have calculated the effect of layer on the error (<span class="math inline">\(\frac{\partial E}{\partial y^l}\)</span>), then the weights are:</p>
<p><span class="math display">\[\frac{\partial E}{\partial w}=\sum_i\frac{\partial E}{\partial y^l_i} \frac{\partial y^l_i}{\partial w},
\]</span></p>
<p>where <span class="math inline">\(i\)</span> specifies the node within layer <span class="math inline">\(l\)</span>.</p>
<p>So why is summation the correct operation? In essence, it is because when the paths from a weight (applied at different locations) merge, they do so with summation. For example, convolution involves summing the paths (in the dot-operation). Other operations such as max pooling and fully connected layers also involve summing the separate paths.</p>
<!--
A kernel is convolved across an entire layer. So, a given weight $w$ of a kernel effects the output the neural network via different paths. The paths can be merged via a kernel or by a fully connected layer. Since these operations sum
Say that the weight $w$ is used to calculate layer $l$.
If we know the effect of layer $l$ on the error, $\frac{\partial E}{\partial y^l}$, then the error $\frac{\partial E}{\partial w}$ can be calculated as:
$$\frac{\partial E}{\partial w}=\frac{\partial E}{\partial y^l}\frac{\partial y^l}{\partial w}.
$$
, given the effect of the layer $l$ to the error, $\frac{\partial E}{\partial y^l}$
given
-->
<h1 id="simple-example">Simple example</h1>
<p>Let’s take a very simple convolutional network.</p>
<p>Let layer <span class="math inline">\(y^0\)</span> be a 2D input layer and <span class="math inline">\([w_0, 0, 0]\)</span> a kernel that is applied to this convolutional layer. For simplicity, lets only have a single kernel. Then:</p>
<p><span class="math display">\[
x^1_{i}=w_0 y^0_{i}
\]</span></p>
<p>An activation function is then applied to this result: <span class="math inline">\(y^1_i=h(x^1_{i})\)</span>.</p>
<p>For the next convolutional layer, let’s say that the kernel <span class="math inline">\([w_1,w_2,w_3]\)</span> is applied. Then:</p>
<p><span class="math display">\[
\begin{aligned}
x^2_{i}&=\sum_{a=1}^3 w_a y^1_{i+a-1} \\
&= w_1 y^1_i + w_2 y^1_{i+1} + w_3 y^1_{i+2} \\
&= w_1 h\left(w_0 y^0_{i}\right) + w_2 h\left(w_0 y^0_{i+1}\right) + w_3 h\left(w_0 y^0_{i+2}\right). \\
\end{aligned}
\]</span> and <span class="math display">\[
y^2_{i} = h(x^2_{i}).
\]</span></p>
<p>So we are interested in <span class="math inline">\(\frac{\partial E}{\partial w_0}\)</span>. Let’s say that the error is only effected by the <span class="math inline">\(j\)</span>th node of the output: <span class="math inline">\(y^2_{j}\)</span>. Then:</p>
<p><span class="math display">\[\frac{\partial E}{\partial w_0} = \frac{\partial E}{\partial y^2_{i}}\frac{\partial y^2_{j}}{\partial x^2_j}\frac{\partial x^2_{j}}{\partial w_0}
\]</span></p>
<p>Assume that we have <span class="math inline">\(\frac{\partial E}{\partial y^2_{j}}\)</span> and <span class="math inline">\(\frac{\partial y^2_{j}}{\partial x^2_j}\)</span>, then we only need to solve for <span class="math inline">\(\frac{\partial x^2_{j}}{\partial w_0}\)</span>.</p>
<p><span class="math display">\[
\begin{aligned}
\frac{\partial x^2_{j}}{\partial w_0}&=\frac{\partial}{\partial w_0} \left(\sum_{a=1}^3 w_a y^1_{j+a-1}\right)\\
&= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( y^1_{j+a-1}\right)\\
&= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( h\left(w_0 y^0_{j+a-1}\right)\right)\\
&= w_1 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j}\right) +
w_2 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+1}\right) +
w_3 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+2}\right). \\
\end{aligned}
\]</span></p>
<p>Notice that each occurrence of <span class="math inline">\(w_0\)</span> is summed separately, and hence why backpropagation sums the shared weights in convolutional networks.</p>
<div id="refs" class="references">
</div>
2015-12-23T00:00:00+00:00http://neural.vision/blog/linux/passwordless-ssh/Passwordless ssh authentication!2015-12-21T21:15:37+00:00jonathan<p>In your local system, check to see if you have the following files:</p>
<ul>
<li>~/.ssh/id_rsa</li>
<li>~/.ssh/id_rsa.pub</li>
</ul>
<p>If not, type:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="fu">ssh-keygen</span> -t rsa</code></pre></div>
<p>And follow the instructions. Note that <code>ssh-agent</code> can be used to securely save your passphrase.</p>
<p>After you have generate your private and public keys, you want to give your remote system the public key:</p>
<div class="sourceCode"><pre class="sourceCode bash"><code class="sourceCode bash"><span class="ex">ssh-copy-id</span> -i ~/.ssh/id_rsa.pub username@remote.system</code></pre></div>
<p>After entering your password, you’re done!</p>
<p>Reference: <a href="http://www.debian-administration.org/articles/152" class="uri">http://www.debian-administration.org/articles/152</a></p>
<div id="refs" class="references">
</div>
2015-12-21T21:15:37+00:00