Blog Archive

Click on a headline to read the teaser.

CLIP, LLaVA, and the Brain - What the brain can teach us about visual processing
How do recent artificial neural networks, like the CLIP (Radford et al. 2021) and LLaVA (Liu et al. 2023) transformer networks, compare to the brain? Is there similarity between the attention in these networks to that in the brain? In this article I look at these transformer architectures with an eye on the similarity and differences with the mammalian brain and visual system. I come to the conclusion that the processing that vision transformers, CLIP, and LLaVA perform is analogous to a type of computation called pre-attentive visual processing. This processing is done in the initial feedforward visual responses to a stimulus before any recurrence. Although a lot can be accomplished in a feedforward way, studies have shown that feedforward pre-attentive processing in the brain does have difficulty with: Distinguishing the identity or characteristics of similar types of objects, especially when objects are close together or cluttered or the objects are unnatural or artificial (VanRullen 2007). More complex tasks such as counting or maze or curve tracing tasks. Perceiving objects that are more difficult to see, such as where it is difficult to perceive the boundaries of the objects. In contrast to the feed-forward only processing, one of the things that really stands out with the brain the richness in the interaction of areas, which I will discuss in more details in the next section. Bidirectional Activity in the Brain In most current deep learning architectures, activity is propagated in a single direction, for example, an image might be given as input to a network and then propagated from layer to layer until you get to a classification as the output. Figure 1: A simplified diagram showing some of the feed-forward and feedback connections in the Macaque brain. The areas that are earlier (or lower-level) are more white, while the areas that later or (higher-level) are more blue. The brain is much more interesting than these feedforward models. In the visual system, a stimulus will propagate from lower to higher level areas in a feedforward-like fashion, but then the higher level areas will also influence the lower level areas as shown in Figure 1. Some of this feedback is the conscious top-down attention that allows us to allocate more resources to objects and features of interest and allows us disambiguate stimuli that is either complex or ambiguous. Another part of this feedback is automatic and allows higher level areas to infuse the lower level areas with information that could not be known in just the feedforward manner. The conscious top-down attention is thought to support consciousness of visual stimuli. Without conscious access to lower level areas that encode borders and edges, we wouldn’t have as spatially precise perception of borders. Tasks such as mentally tracing a curve or solving a maze would become impossible. One example of the automatic unconscious feedback is border-ownership which is seen in about half of the orientation-selective neurons in visual area V2 (Zhou, Friedman, and von der Heydt 2000; Williford and von der Heydt 2013). These neurons will encode local information in about 40 ms and, as early as 10 ms after this initial response, will start to incorporate global context to resolve occlusions - holding the information needed to know which object are creating borders by occluding their backgrounds. Another example of this unconscious feedback was shown in Poort et al. (2012) using the images like that in Figure 2. In the Macaque early visual cortex V1, neurons will tend to initially (within 50-75 ms of stimulus presentation) encode only the local features within their receptive fields (e.g. green square). However, after around 75 ms, they will receive feedback from the higher level areas and they will tend to have a higher response when that texture belongs to a figure, such as this texture defined figure above. This happens even when attention is drawn away from the figure, however if the monkey is paying attention to the figure the neurons will tend to respond even more. Figure 2: Image from (Poort et al. 2012). Shapes that are defined only by texture, like the above, can be difficult to see in a pure “feed-forward” manner. The biological visual system is able to recognize shapes like these through the interaction of lower and higher level areas, including top-down attention and subconscious processes. One way to look at this bidirectional interaction is that at any given time, each neuron greedily uses all available predictive signals. Even higher level areas can be informative. Transformers With all the talk about attention with the introduction of transformers (Vaswani et al. 2017) and with the ability to generate sentences one word at a time, you might be led to believe that transformers have recurrence. However, there is no “state” that is kept between the steps of the transformer, except for the previous output. So at best the recurrence is very limited and there is no bidirectionality that is ubiquitous in the brain. Transformers do allow for multi-headed attention, which could be interpreted as being able to attend to multiple things simultaneously. In the original paper, the transformer used 8 attention heads. Image transformers can be seen as analogous to pre-attentive feedforward processing with some modifications, like with the multiple attention heads. CLIP Figure 3: Image from Radford et al. (2021) depicting how CLIP is trained. \(I_1\) and \(T_1\) are the encodings of image 1 and the corresponding caption. A contrastive learning loss is used to make the \(I_i\) and \(T_j\) more similar when \(i=j\) and more dissimilar when \(i≠j\). Weights are trained from scratch. CLIP was introduced by OpenAI in the Radford et al. (2021) paper “Learning Transferable Visual Models from Natural Language Supervision”. The idea behind CLIP is pretty simple and is shown in Figure 3. It takes a bunch of image and caption pairs from the Internet, feeds the image to an image encoder or and the text to a text encoder. It then uses a loss that brings the encoding of the image and the encoding of the text closer together when they are in the same pair, otherwise the loss increases the distance of the encodings. This is what CLIP gives you: the ability to compare the similarity between text and images. One way this can be used is for zero-shot classification, as shown in Figure 4. CLIP does not, by itself, generate text descriptions from images. The image encoder and text encoder are independent, meaning that there is no way for task-driven modulation to influence the image encoding. This means that the image encoder has to encode everything that could be potentially relevant to the task. Typically the resolution of the input image is pretty small, which helps prevent the computation and memory requirements from exploding. Figure 4: Image from Radford et al. (2021) depicting how CLIP can be used for zero-shot classification. Text encodings are generated for each class \(T_1\ldots T_N\). The image is then encoded and the similarity is measured with the generated text encodings. The most similar text encoding is the chosen class. LLaVA Figure 5: LLaVA architecture from Liu et al. (2023). \(\mathrm X_v\): image, \(\mathrm X_c\) : caption, \(\mathrm X_q\) : question derived from \(\mathrm X_c\) using GPT4 Large Language and Vision Assistant (LLaVA) (Liu et al. 2023) is a large language and vision architecture that extends and builds onto CLIP to add the ability to describe and answer questions about images. This type of architecture is interesting to me because it can attempt tasks that are similar to those used in Neuroscience and Psychology. LLaVA takes the vision transformer model ViT-L/14 that is trained by CLIP for image encoding Figure 5. To convert the encodings into tokens, the first paper uses a single linear projection matrix \(W\) for this transformation. The tokens calculated from the images \(H_v\) and the tokens from the text instructions \(H_q\) are provided as input. LLaVA can then generate the language response \(X_a\) one token at a time, each time appending the response so far as the input to the next iteration. I won’t go into the details of how LLaVA is trained, but it is interesting how they use ChatGPT to expand the caption (\(\mathrm X_c\) in Figure 5) to form instructions (\(\mathrm H_q\)) and responses (used to train \(\mathrm X_a\)) about an image and the use of bounding box information. In version 1.5 of LLaVA (Liu et al. 2024), some of the improvements they made include: The linear projection matrix \(\mathrm W\) is replaced with a multilayer perceptron The image resolution is increased by using an image encoder that takes images of size 336x336 pixels and split the images into grids that are encoded separately. Task driven attention in the brain is able to dynamically allocate resources to the object, location, or features of interest, which can allow processing of information that could otherwise be overwhelmed by clutter or other objects. In LLaVA, the image encoder is independent of the text instructions, so to be successful it needs to make sure any potentially useful information is stored in the image tokens (\(\mathrm H_v\)). Conclusion Since LLaVA and CLIP lack bidirectional processing, the processing that they do is limited. This is especially true for image processing, since image processing is done independent of the text instructions. Most convolutional neural networks also shares these limitations. This leads me to my conjecture: Conjecture: Most convolutional, vision transformer, and multimodal transformer networks is restricted to something pre-attentive feedforward visual processing. This is not necessarily a criticism as much as an insight that can be informative. Feedforward processing can do a lot and is fast. However, it is not as dynamic as to what resources can be used to be used, which can lead to informational bottlenecks in cluttered scenes and is unable to encode enough information for complex tasks without an explosion of the size of the encodings. There are some networks that are not limited to pre-attentive feedforward networks, but currently most of the architectures lag behind those of transformers. These include, long-short term memory models (LSTMs) and, more recently, the Mamba architecture which has several benefits over transformers (Gu and Dao 2024). Extended LSTMs (Beck et al. 2024; Alkin et al. 2024) have been proposed that help make up some of the ground between transformers and LSTMs. References Alkin, Benedikt, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, and Johannes Brandstetter. 2024. “Vision-LSTM: xLSTM as Generic Vision Backbone.” June 6, 2024. http://arxiv.org/abs/2406.04303. Beck, Maximilian, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. 2024. “xLSTM: Extended Long Short-Term Memory.” May 7, 2024. http://arxiv.org/abs/2405.04517. Gu, Albert, and Tri Dao. 2024. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” May 31, 2024. http://arxiv.org/abs/2312.00752. Liu, Haotian, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. “Improved Baselines with Visual Instruction Tuning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26296–306. https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Improved_Baselines_with_Visual_Instruction_Tuning_CVPR_2024_paper.html. Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. “Visual Instruction Tuning.” December 11, 2023. https://doi.org/10.48550/arXiv.2304.08485. Poort, Jasper, Florian Raudies, Aurel Wannig, Victor A F Lamme, Heiko Neumann, and Pieter R Roelfsema. 2012. “The Role of Attention in Figure-Ground Segregation in Areas V1 and V4 of the Visual Cortex.” Neuron 75 (1): 143–56. https://doi.org/10.1016/j.neuron.2012.04.032. Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In International Conference on Machine Learning, 8748–63. PMLR. http://proceedings.mlr.press/v139/radford21a. VanRullen, Rufin. 2007. “The Power of the Feed-Forward Sweep.” Advances in Cognitive Psychology 3 (1-2): 167. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864977/. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30. https://proceedings.neurips.cc/paper/7181-attention-is-all. Williford, Jonathan R., and Rudiger von der Heydt. 2013. “Border-Ownership Coding.” Scholarpedia 8 (10): 30040. http://scholarpedia.org/article/Border-ownership_coding. Zhou, H., H. S. Friedman, and R. von der Heydt. 2000. “Coding of Border Ownership in Monkey Visual Cortex.” The Journal of Neuroscience 20 (17): 6594–6611. Read More ›

Review of Ioffe & Szegedy 2015 *Batch normalization*
Normalization of training inputs has long been shown to increase the speed of learning in networks. The paper (Ioffe and Szegedy 2015) introduces a major improvement in deep learning, batch normalization (BN), which extends this idea by normalizing the activity within the network, across mini-batches (batches of training examples). BN has been gaining a lot of traction in the academic literature, for example being used to improve segmentation (Hong, Noh, and Han 2015) and variational autoencoders (Sønderby et al. 2016). The authors state that adding BN allows a version of the Inception image classification model to learn with 14 times fewer training steps, when additional modifications are made in order to take advantage of BN. One of the modifications is removing the Dropout layers, because BN acts as a regularizer and actually eliminates the need for Dropout. It also allows for the learning rate to be increased. It does all this while actually adding a small number of parameters to be learned during training. A non-batch version of BN may even have a biological homolog: homeostatic plasticity. BN separates the learning of the overall distribution of the activity of the neuron and the specific synaptic weights. For each “activation” \(x^{(k)}\), the parameters for the mean and spread of the distribution of the activation is given by the learned parameters \(\beta^{(k)}\) and \(\gamma^{(k)}\) respectively. Details The original paper (Ioffe and Szegedy 2015) states that the normalization should be done per activation \(k\). In the initial part of the paper the definition of activation is left open. In their experiments, however, they do the normalization across each feature map (across batches and locations, for a specific feature). Batch normalization step For now, this section just regurgitates some of the basic information from the original paper. Let \(x^{(k)}_i\) be a specific activation \(k\) for a given input\(i\). Batch normalization then normalizes this activation over all the inputs of the batch (mini-batch) of inputs \(i \in \{ 1 \ldots m \}\). BN normalizes the data to a Gaussian where the mean and variation of the Gaussian is learned during training. This is done by first normalizing the data to the standard Gaussian (\(\mu=0\) and \(\sigma=1\)), and then adding the offsets \(\beta^{(k)}\) and scaling by \(\gamma^{(k)}\). Let \(\mathcal B = \left\{ x_{1 \ldots m}\right\}\) be a given batch. \(\mu ^{k}_ \mathcal B\) and ${( ^{k}_B )} ^ 2 $ be the mean and variance of a given activation, \(k\), across the batch of training inputs. The normalization / whitening step is then: \[ \hat x_i = \frac {x_i - \mu_{\mathcal B}} {\sqrt{\sigma_{\mathcal B}^2 + \epsilon}}. \] And then there is the re-scaling and shifting step: \[ y^{(k)}_i = \gamma^{(k)} \hat x_i + \beta^{(k)}, \] where \(\gamma^{(k)}\) and \(\beta^{(k)}\), once again, are learned parameters. Discussion on Caffe’s Implementation There is an interesting discussion on Caffe’s implementation in the pull request (PR): https://github.com/BImplementationVLC/caffe/pull/3229 Modifying models for BN Adding BN by itself can speedup training. However, in order to fully take advantage of the BN, additional steps need to be made. The authors suggestions include: Increase the learning rate (how much?). Remove Dropout. Reduce the L2 weight regularization by a factor of 5. Increase the learning rate decay by 6. Perform “within-shard” shuffling - although I don’t know what this is. References Hong, Seunghoon, Hyeonwoo Noh, and Bohyung Han. 2015. “Decoupled Deep Neural Network for Semi-Supervised Semantic Segmentation.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1495–1503. Curran Associates, Inc. http://papers.nips.cc/paper/5858-decoupled-deep-neural-network-for-semi-supervised-semantic-segmentation.pdf. Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167 [Cs], February. http://arxiv.org/abs/1502.03167. Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks.” arXiv:1602.02282 [Cs, Stat], February. http://arxiv.org/abs/1602.02282. Read More ›

Review of Ichida, Schwabe, Bressloff, & Angelucci (2007) 'Response Facilitation From the “Suppressive” Receptive Field Surround of Macaque V1 Neurons'
Overview The extraclassical surround (ECS) generally suppresses the firing rate of visual neurons in the primary visual cortex (V1), especially when the surround stimulus has the same orientation (iso-oriented). However, it has been shown that the ECS can actually enhance the firing rate when the stimulus has a low contrast. In (Ichida et al. 2007), the authors test a prediction from a model they have published (Schwabe et al. 2006): that the far ECS, and not just the immediate ECS, can enhance the response. They find that the far ECS can indeed enhance the response, but only when the immediate ECS does not contain the iso-oriented stimulus. Methods The authors define the classical receptive field (CRF, called the minimum response field in the paper) as the region that can be driven using a small high contrast 0.1° grating. The size of the CRF depends on the contrast of the stimulus. It is larger for low contrast stimuli than high contrast stimuli. The authors define the immediate ECS as the area beyond the high contrast CRF where a low contrast stimulus would increase the response. Instead of using the CRF, they used what they call the high-contrast and low-contrast summation RF. They first found the CRF (minimum response field) by using a 0.1° grating. They then used this to center a high-contrast grating patch. They then varied the size of the patch and found the size that optimally simulated the cell. They called this the high contrast summation RF (SRF_high) or simply the RF center. They used the same protocol with a low contrast grating to find the low-contrast summation RF (SRF_low). They called the region between SRF_high and SRF_low the near surround. The far surround’s outer diameter was set to 14°. The inner diameter varied but was no smaller than the SRF_low. References Ichida, Jennifer M., Lars Schwabe, Paul C. Bressloff, and Alessandra Angelucci. 2007. “Response Facilitation From the ‘Suppressive’ Receptive Field Surround of Macaque V1 Neurons.” Journal of Neurophysiology 98 (4): 2168–81. https://doi.org/10.1152/jn.00298.2007. Schwabe, Lars, Klaus Obermayer, Alessandra Angelucci, and Paul C. Bressloff. 2006. “The Role of Feedback in Shaping the Extra-Classical Receptive Field of Cortical Neurons: A Recurrent Network Model.” J. Neurosci. 26 (36): 9117–29. https://doi.org/10.1523/JNEUROSCI.1253-06.2006. Read More ›

Review of Zoccolan et al. 2005 Multiple
Even though the neurons in inferotemporal cortex (IT) have very large receptive fields, it is tempting the believe that the neurons would be able to distinguish objects presented within their receptive fields. For example, if a neuron responds to object A and B at different rates, perhaps the neuron should give the maximum of these two rates when both stimuli are presented within their receptive field. The study (Zoccolan, Cox, and DiCarlo 2005) shows that this is not the case and, when presented with two objects, most IT neurons’ responses are the mean of the firing rates when the objects are presented separately - at least for short presentation times and when the objects are not attended. There is a lot more to this paper than what I will cover in this review / note. I hope to add more in the future, but the most important points are straightforward. They use simple artificial shapes on a plain background. The first results show that in the population, the cells’ responses to the presentation of multiple objects cluster around the mean of their responses of when the objects are presented separately. There is slight tendency to fire at a rate slightly higher than the average, but the lack of scatter is rather amazing. There is a line in Figure 1C and 1D for the sum responses and very few of the cells fall on or above this line. They then show that the responses to the combined object displays are much more like the mean of the responses to individual object displays than a max model, at least in the mean cell population. There is a lot of spread in these results, leaving open the possibility that some neurons give a response that is the maximum of the response to the two objects separately (or having an even higher response). Zoccolan, Davide, David D. Cox, and James J. DiCarlo. 2005. “Multiple Object Response Normalization in Monkey Inferotemporal Cortex.” J. Neurosci. 25 (36): 8150–64. https://doi.org/10.1523/JNEUROSCI.2058-05.2005. Read More ›

Review of Liu, Hashemi-Nezhad, & Lyon (2015) 'Contrast invariance of orientation tuning in cat primary visual cortex neurons depends on stimulus size'
Overview There are two main findings from (Liu, Hashemi-Nezhad, and Lyon 2015) in the the primary visual cortex (V1) using anesthetized cat. First, that contrast invariance orientation tuning depends on having a stimulus that extends beyond the CRF. If the stimulus is optimized for the CRF, then the tuning width decreases with lower contrast (illustrated in Figure 3 of the paper). The orientation tuning profile is invariant when the stimulus extends to the surround, but when is only covers the CRF. The second main finding (illustrated in Figure 4 of the paper) is that contrast invariance appears with the large stimulus because the tuning width decreases in the high contrast stimulus when the surround stimulus is added to the CRF stimulus. The tuning width for the low contrast conditions on average stays the same with or without the stimulus in the surround (although individual cells may be facilitated or suppressed). This results of (Liu, Hashemi-Nezhad, and Lyon 2015) are difficult to reconcile with classical results and, for me, indicate that a better measure of contrast-invariant orientation tuning is needed. This paper should definitely be read for anyone interested in this feature. Stimulus and Methods For the main experiment, they have two contrast conditions (low and high) that are defined for each neuron and two size conditions (CRF and CRF+ECS) that are defined for each contrast (and neuron). The smaller of the two sizes, the CRF / patch condition, is defined as the size that produces the largest response from the cell. The larger size, the CRF+ECS (extraclassical surround) condition, is defined by the size that produces the maximum suppression. The paper almost exclusively reports the half-width at half height (HWHH). This is half the width of the (fitted) orientation tuning curve that elicits half of the maximum response of that tuning curve. Discussion The paper states in the discussion that most other papers on this topic did not use the optimally sized stimulus, hence why they report different results. They do point out that (finn_contrast-invariant_2007?) did use a similar CRF condition, but reported different results presumably because they used patch clamping. In Supplemental Fig. 3 of Finn et al., there are some extracellularly recorded neurons that reportedly are more consistent (I haven’t checked results). Deep anesthesia is known to change the properties of ECS of early visual neurons. It is unclear to me how much the results from anesthesized animals can be generalized to the normal awake state. Liu, Yong-Jun, Maziar Hashemi-Nezhad, and David C. Lyon. 2015. “Contrast Invariance of Orientation Tuning in Cat Primary Visual Cortex Neurons Depends on Stimulus Size.” J Physiol 593 (19): 4485–98. https://doi.org/10.1113/JP271180. Read More ›

Change konsole appearance during SSH
Everyone knows that feeling: when you have many consoles open at the same time connected via ssh to various servers. In this post I’m going to show a simple trick that allows you to change the background whenever you ssh to a server and changes it back when you logout - well, at least if you are using KDE (or have konsole installed). For example, I have a virtual linux system that I call “Puffin”. I’ve created an alias “ssh-puffin” to login via ssh. Before ssh session I have setup this alias to change the background of konsole: During ssh session And then, after I log out, the konsole switches back to the local profile (and gives a warm and fuzzy welcome-back message). After ssh session Step 1: Add konsole profile(s) Create konsole profiles and corresponding color schemes for your local system (“Local”) and remote systems (“Puffin”). You only need to really create the color schemes, but I always create a separate profile with the same name. This is done by going to Settings of a konsole window and selecting “Manage Profiles”. You can access the color schemes by clicking edit (or new) and then clicking on Appearance. I created the Puffin background with GIMP using layers and an image from Wikimedia Commons by Richard Bartz. You can, of course, change the console appearance in other ways. Step 2: Modify .bashrc Add the following to your .bashrc file: alias resetcolors="konsoleprofile colors=Local" alias ssh-puffin="konsoleprofile colors=Puffin; ssh puffin; resetcolors; echo 'Welcome back'"'!' If you have many remote servers, you may want to add your .bashrc file to github or the cloud™. Step 3: Enjoy awesomeness After reloading .bashrc, you can then log into the server using your alias. Acknowledgements I first figured out how to do this from a blog post by Abdussamad. Read More ›

Backpropagation with shared weights in convolutional neural networks
The success of deep convolutional neural networks would not be possible without weight sharing - the same weights being applied to different neuronal connections. However, this property also makes them more complicated. This post aims to give an intuition of how backpropagation works with weight sharing. For a more well-rounded introduction to backpropagation of convolutional neural networks, see Andrew Gibiansky’s blog post. Backpropagation is used to calculate how the error in a neural network changes with respect to changes in a weight \(w\) in that neural network. In other words, it calculates: \[\frac{\partial E}{\partial w}, \] where \(E\) is the error and \(w\) is a weight. For traditional feed-forward neural networks, each connection between two neurons has it’s own weight and the calculation of the backpropagation is generally straightforward using the chain rule. For example, if you know how the error changes with respect the node \(y_i\) (ie. \(\frac{\partial E}{\partial y_i}\)), then calculating the contribution of the pre-synaptic weights of that node is simply: \[\frac{\partial E}{\partial w}=\frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial w}. \] This is complicated in convolutional neural networks because the weight \(w\) is used for multiple nodes (often, most or all nodes in the same layer). Handling shared weights In classical convolutional neural networks, shared weights are handled by summing together each instance that the weight appears in backpropagation derivation, instead of, for example, taking the average of each occurrence. So, if layer \(y^l\) is the layer “post-synaptic” to the weight \(w\) and we have calculated the effect of layer on the error (\(\frac{\partial E}{\partial y^l}\)), then the weights are: \[\frac{\partial E}{\partial w}=\sum_i\frac{\partial E}{\partial y^l_i} \frac{\partial y^l_i}{\partial w}, \] where \(i\) specifies the node within layer \(l\). So why is summation the correct operation? In essence, it is because when the paths from a weight (applied at different locations) merge, they do so with summation. For example, convolution involves summing the paths (in the dot-operation). Other operations such as max pooling and fully connected layers also involve summing the separate paths. Simple example Let’s take a very simple convolutional network. Let layer \(y^0\) be a 2D input layer and \([w_0, 0, 0]\) a kernel that is applied to this convolutional layer. For simplicity, lets only have a single kernel. Then: \[ x^1_{i}=w_0 y^0_{i} \] An activation function is then applied to this result: \(y^1_i=h(x^1_{i})\). For the next convolutional layer, let’s say that the kernel \([w_1,w_2,w_3]\) is applied. Then: \[ \begin{aligned} x^2_{i}&=\sum_{a=1}^3 w_a y^1_{i+a-1} \\ &= w_1 y^1_i + w_2 y^1_{i+1} + w_3 y^1_{i+2} \\ &= w_1 h\left(w_0 y^0_{i}\right) + w_2 h\left(w_0 y^0_{i+1}\right) + w_3 h\left(w_0 y^0_{i+2}\right). \\ \end{aligned} \] and \[ y^2_{i} = h(x^2_{i}). \] So we are interested in \(\frac{\partial E}{\partial w_0}\). Let’s say that the error is only effected by the \(j\)th node of the output: \(y^2_{j}\). Then: \[\frac{\partial E}{\partial w_0} = \frac{\partial E}{\partial y^2_{i}}\frac{\partial y^2_{j}}{\partial x^2_j}\frac{\partial x^2_{j}}{\partial w_0} \] Assume that we have \(\frac{\partial E}{\partial y^2_{j}}\) and \(\frac{\partial y^2_{j}}{\partial x^2_j}\), then we only need to solve for \(\frac{\partial x^2_{j}}{\partial w_0}\). \[ \begin{aligned} \frac{\partial x^2_{j}}{\partial w_0}&=\frac{\partial}{\partial w_0} \left(\sum_{a=1}^3 w_a y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( h\left(w_0 y^0_{j+a-1}\right)\right)\\ &= w_1 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j}\right) + w_2 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+1}\right) + w_3 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+2}\right). \\ \end{aligned} \] Notice that each occurrence of \(w_0\) is summed separately, and hence why backpropagation sums the shared weights in convolutional networks. Read More ›

Passwordless ssh authentication!
In your local system, check to see if you have the following files: ~/.ssh/id_rsa ~/.ssh/id_rsa.pub If not, type: ssh-keygen -t rsa And follow the instructions. Note that ssh-agent can be used to securely save your passphrase. After you have generate your private and public keys, you want to give your remote system the public key: ssh-copy-id -i ~/.ssh/id_rsa.pub username@remote.system After entering your password, you’re done! Reference: http://www.debian-administration.org/articles/152 Read More ›