Blog Archive

Click on a headline to read the teaser.

Neuromorphic Computing - an Edgier, Greener AI

Neuromorphic AI might not just help bring AI to the edge, but also reduce carbon emissions. Generated by author with ImageGen 3. There are periodic proclamations of the coming neuromorphic computing revolution. While there remain substantial challenges in the field, there are solid successes that have come out of the field and there continues to be steady progress in using spiking neural network algorithms with neuromorphic hardware. In this article, I will cover some neuromorphic computing and engineering basics, training, the advantages of neuromorphic systems, and the remaining challenges. The classical use case of neuromorphic systems is for edge devices that need to perform the computation locally and are energy-limited, for example, battery-powered devices. However, one of the recent interests in using neuromorphic systems is to reduce energy usage at data centers, such as the energy needed by large language models (LLMs). For example, OpenAI signed a letter of intent to purchase $51 million of neuromorphic chips from Rain AI in December 2023. This makes sense since OpenAI spends a lot on inference, with one estimate of around $4 billion on running inference in 2024. It also appears that both Intel’s Loihi 2 and IBM’s NorthPole (successor to TrueNorth) neuromorphic systems are designed for use in servers. The promises of neuromorphic computing can broadly be divided into 1) pragmatic, near-term successes that have already found successes and 2) more aspirational, wacky neuroscientist fever-dream ideas of how spiking dynamics might endow neural networks with something closer to real intelligence. Of course, it’s group 2 that really excites me, but I’m going to focus on group 1 for this post. And there is no more exciting way to start than to dive into terminology. Terminology Neuromorphic computation is often defined as computation that is brain-inspired, but that definition leaves a lot to the imagination. Neural networks are more neuromorphic than classical computation, but these days neuromorphic computation is specifically interested in using event-based spiking neural networks (SNNs) for their energy efficiency. Even though SNNs are a type of artificial neural network, the term “artificial neural networks” (ANNs) is reserved for the more standard non-spiking artificial neural networks in the neuromorphic literature. Schuman and colleagues (2022) define neuromorphic computers as non-von Neuman computers where both processing and memory are collocated in artificial neurons and synapses, as opposed to von Neuman computers that separate processing and memory. von Neumann Computers operate on digital information, have separate processors and memory, and are synchronized by clocks, while neuromorphic computers operate on event-driven spikes, combine compute and memory, and are asynchronous. Created by author with inspiration from Schuman et al 2022. Neuromorphic engineering means designing the hardware while “neuromorphic computation” is focused on what is being simulated rather than what it is being simulated on. These are tightly intertwined since the computation is dependent on the properties of the hardware and what is implemented in hardware depends on what is empirically found to work best. Another related term is NeuroAI, the goal of which is to use AI to gain a mechanistic understanding of the brain and is more interested in biological realism. Neuromorphic computation is interested in neuroscience as a means to an end. It views the brain as a source of ideas that can be used to achieve objectives such as energy efficiency and low latency in neural architectures. A decent amount of the NeuroAI research relies on spike averages rather than spiking neural networks, which allows closer comparison of the majority of modern ANNs that are applied to discrete tasks. Event-Driven Systems Generated by author using ImageGen 3. Neuromorphic systems are event-based, which is a paradigm shift from how modern ANN systems work. Even real-time ANN systems typically process one frame at a time, with activity synchronously propagated from one layer to the next. This means that in ANNs, neurons that carry no information require the same processing as neurons that carry critical information. Event-driven is a different paradigm that often starts at the sensor and applies the most work where information needs to be processed. ANNs rely on matrix operations that take the same amount of time and energy regardless of the values in the matrices. Neuromorphic systems use SNNs where the amount of work depends on the number of spikes. A traditional deployed ANN would often be connected to a camera that synchronously records a frame in a single exposure. The ANN then processes the frame. The results of the frame might then be fed into a tracking algorithm and further processed. Event-driven systems may start at the sensor with an event camera. Each pixel sends updates asynchronously whenever a change crosses a threshold. So when there is movement in a scene that is otherwise stationary, the pixels that correspond to the movement send events or spikes immediately without waiting for a synchronization signal. The event signals can be sent within tens of microseconds, while a traditional camera might collect at 24 Hz and could introduce a latency that’s in the range of tens of milliseconds. In addition to receiving the information sooner, the information in the event-based system would be sparser and would focus on the movement. The traditional system would have to process the entire scene through each network layer successively. Learning in Spiking Neural Networks One way to teach a spiking neural network is to have a teacher [ANN]. Generated by author with ImageGen 3. One of the major challenges of SNNs is training them. Backpropagation algorithms and stochastic gradient descent are the go-to solutions for training ANNs, however, these methods run into difficulty with SNNs. The best way to train SNNs is not yet established and the following methods are some of the more common approaches that are used: ANN to SNN conversion Backpropagation-like Synaptic plasticity Evolutionary ANN to SNN conversion One method of creating SNNs is to bypass training the SNNs directly and instead train ANNs. This approach limits the types of SNNs and hardware that can be used. For example, Sengupta et al (2019) converted VGG and ResNets to ANNs using an integrate-and-fire (IF) neuron that does not have a leaking or refractory period. They introduce a novel weight-normalization technique to perform the conversion, which involves setting the firing threshold of each neuron based on its pre-synaptic weights. Dr. Priyadarshini Panda goes into more detail in her ESWEEK 2021 SNN Talk. Benefits: Enables deep SNNs. Allows reuse of deep ANN knowledge, such as training, architecture, etc. Drawbacks: Limits architectures to those suited to ANNs and the conversion procedures. Network doesn’t learn to take advantage of SNN properties, which can lead to lower accuracy and longer latency. Backpropagation-like approaches and surrogate gradient descent The most common methods currently used to train SNNs are backpropagation-like approaches. Standard backpropagation does not work to train SNNs because 1) the spiking threshold function’s gradient is nonzero except at the threshold where it is undefined and 2) the credit assignment problem needs to be solved in the temporal dimension in addition spatial (or color etc). In ANNs, the most common activation function is the ReLU. For SNNs, the neuron will fire if the membrane potential is above some threshold, otherwise, it will not fire. This is called a Heaviside function. You could use a sigmoid function instead, but it is not a spiking neural network. The solution of using surrogate gradients is to use the standard threshold function in the forward pass, but then use the derivative from a “smoothed” version of the Heaviside function, such as the sigmoid function, in the backward pass (Neftci et al. 2019, Bohte, 2011). Advantages: Connects to well-known methods. Compared to conversion, can result in a more energy efficient network (Li et al 2022) Disadvantages: Can be computationally intensive to solve both spatially and through time Synaptic Plasticity Spike-timing-dependent plasticity (STDP) is the most well-known form of synaptic plasticity. In most cases, STDP increases the strength of a synapse when a presynaptic (input) spike comes immediately before the postsynaptic spike. Early models have shown promise with STDP on simple unsupervised tasks, although getting it to work well for more complex models and tasks has proven more difficult. Other biological learning mechanisms include the pruning and creation of both neurons and synapses, homeostatic plasticity, neuromodulators, astrocytes, and evolution. There is even some recent evidence that some primitive types of knowledge can be passed down by epigenetics. Advantages: Unsupervised Can take advantage of temporal properties Biologically inspired Disadvantages: Synaptic plasticity is not well understood, especially at different timescales Difficult to get to work with non-trivial networks Evolutionary Optimization Evolutionary optimization is another approach that has some cool applications that works well with small networks. Dr. Catherine Schuman is a leading expert and she gave a fascinating talk on neuromorphic computing to the ICS lab that is available on YouTube. Advantages: Applicable to many tasks, architectures, and devices. Can learn topology and parameters (requiring less knowledge of the problem). Learns small networks which results in lower latency. Disadvantages: Not effective for problems that require deep or large architectures. Advantages of Neuromorphic Systems Energy Efficiency Neuromorphic systems have two main advantages: 1) energy efficiency and 2) low latency. There are a lot of reasons to be excited about the energy efficiency. For example, Intel claimed that their Loihi 2 Neural Processing Unit (NPU) can use 100 times less energy while being as much as 50 times faster than conventional ANNs. Chris Eliasmith compared the energy efficiency of an SNN on neuromorphic hardware with an ANN with the same architecture on standard hardware in a presentation available on YouTube. He found that the SNN is 100 times more energy efficient on Loihi compared to the ANN on a standard NVIDIA GPU and 20 times more efficient than the ANN on an NVIDIA Jetson GPU. It is 5-7 times more energy efficient than the Intel Neural Compute Stick (NCS) and NCS 2. At the same time the SNN achieves a 93.8% accuracy compared to the 92.7% accuracy of the ANN. Figure recreated by author from Chris Eliasmith’s slides at https://www.youtube.com/watch?v=PeW-TN3P1hk&t=1308s which shows the neuromorphic processor being 5-100x more efficient while achieving a similar accuracy. Neuromorphic chips are more energy efficient and allow complex deep learning models to be deployed on low-energy edge devices. In October 2024, BrainChip introduced the Akida Pico NPU which uses less than 1 mW of power, and Intel Loihi 2 NPU uses 1 W. That’s a lot less power than NVIDIA Jetson modules that use between 10-50 watts which is often used for embedded ANNs and server GPUs can use around 100 watts. Comparing the energy efficiency between ANNs and SNNs are difficult because: 1. energy efficiency is dependent on hardware, 2. SNNs and ANNs can use different architectures, and 3. they are suited to different problems. Additionally, the energy used by SNNs scales with the number of spikes, so the number of spikes needs to be minimized to achieve the best energy efficiency. Theoretical analysis is often used to estimate the energy needed by SNNs and ANNs, however, this doesn’t take into account all of the differences between the CPUs and GPUs used for ANNs and the neuromorphic chips for SNNs. Looking into nature can give us an idea of what might be possible in the future and Mike Davies provided a great anecdote in an Intel Architecture All Access YouTube video: Consider the capabilities of a tiny cockatiel parrot brain, a two-gram brain running on about 50 mW of power. This brain enables the cockatiel to fly at speeds up to 20 mph, to navigate unknown environments while foraging for food, and event to learn to manipulate objects as tools and utter human words. In current neural networks, there is a lot of wasted computation. For example, an image encoder takes the same amount of time encoding a blank page as a cluttered page in a “Where’s Waldo?” book. In spiking neural networks, very few units would activate on a blank page and very little computation would be used, while a page containing a lot of features would fire a lot more units and use a lot more computation. In real life, there are often regions in the visual field that contain more features and require more processing than other regions that contain fewer features, like a clear sky. In either case, SNNs only perform work when work needs to be performed, whereas ANNs depend on matrix multiplications that are difficult to use sparsely. This in itself is exciting. A lot of deep learning currently involves uploading massive amounts of audio or video to the cloud, where the data is processed in massive data centers, spending a lot of energy on the computation and cooling the computational devices, and then the results are returned. With edge computing, you can have more secure and more responsive voice recognition or video recognition, that you can keep on your local device, with orders of magnitude less energy consumption. Low Latency When a pixel receptor of an event camera changes by some threshold, it can send an event or spike within microseconds. It doesn’t need to wait for a shutter or synchronization signal to be sent. This benefit is seen throughout the event-based architecture of SNNs. Units can send events immediately, rather than waiting for a synchronization signal. This makes neuromorphic computers much faster, in terms of latency, than ANNs. Hence, neuromorphic processing is better than ANNs for real-time applications that can benefit from low latency. This benefit is reduced if the problem allows for batching and you are measuring speed by throughput since ANNs can take advantage of batching more easily. However, in real-time processing, such as robotics or user interfacing, latency is more important. Disadvantages and Challenges Everything Everywhere All at Once One of the challenges is that neuromorphic computing and engineering are progressing at multiple levels at the same time. The details of the models depend on the hardware implementation and empirical results with actualized models guide the development of the hardware. Intel discovered this with their Loihi 1 chips and built more flexibility into their Loihi 2 chips, however, there will always be tradeoffs and there are still many advances to be made on both the hardware and software side. Limited Availability of Commercial Hardware Hopefully, this will change soon, but commercial hardware isn’t very available. BrainChip’s Akida was the first neuromorphic chip to be commercially available, although apparently, it does not even support the standard leaky-integrate and fire (LIF) neuron. SpiNNaker boards used to be for sale, which was part of the EU Human Brain Project but are no longer available. Intel makes Loihi 2 chips available to some academic researchers via the Intel Neuromorphic Research Community (INRC)program. Datasets The number of neuromorphic datasets is much less than traditional datasets and can be much larger. Some of the common smaller computer vision datasets, such as MNIST (NMNIST, Li et al 2017) and CIFAR-10 (CIFAR10-DVS, Orchard et al 2015), have been converted to event streams by displaying the images and recording them using event-based cameras. The images are collected with movement (or “saccades”) to increase the number of spikes for processing. With larger datasets, such as ES-ImageNet (Lin et al 2021), simulation of event cameras has been used. The dataset derived from static images might be useful in comparing SNNs with conventional ANNs and might be useful as part of the training or evaluation pipeline, however, SNNs are naturally temporal, and using them for static inputs does not make a lot of sense if you want to take advantage of SNNs temporal properties. Some of the datasets that take advantage of these properties of SNNs include: - DvsGesture (Amir et al. 2017) - a dataset of people performing a set of 11 hand and arm gestures - Bullying10K (Dong et al. 2024) - a privacy-preserving dataset for bullying recognition Synthetic data can be generated from standard visible camera data without the use of expensive event camera data collections, however these won’t exhibit the high dynamic range and frame rate that event cameras would capture. Tonic is an example python library that makes it easy to access at least some of these event-based datasets. The datasets themselves can take up a lot more space than traditional datasets. For example, the training images for MNIST is around 10 MB, while in N-MNIST, it is almost 1 GB. Another thing to take into account is that visualizing the datasets can be difficult. Even the datasets derived from static images can be difficult to match with the original input images. Also, the benefit of using real data is typically to avoid a gap between training and inference, so it would seem that the benefit of using these datasets would depend on their similarity to the cameras used during deployment or testing. Conclusion Neuromorphic Computers are the Wave of the Future! Created by author with ImageFx and GIMP. We are in an exciting time with neuromorphic computation. There are still challenges for adoption, but there are proven cases where they are more energy efficient, especially standard server GPUs while having lower latency and similar accuracy as traditional ANNs. A lot of companies, including Intel, IBM, Qualcomm, Analog Devices, Rain AI, and BrainChip have been investing in neuromorphic systems. BrainChip is the first company to make their neuromorphic chips commercially available while both Intel and IBM are on the second generations of their research chips (Loihi 2 and NorthPole respectively). References Amir, A., Taba, B., Berg, D., Melano, T., McKinstry, J., Di Nolfo, C., Nayak, T., Andreopoulos, A., Garreau, G., Mendoza, M., Kusnitz, J., Debole, M., Esser, S., Delbruck, T., Flickner, M., & Modha, D. (2017). A Low Power, Fully Event-Based Gesture Recognition System. 7243–7252. https://openaccess.thecvf.com/content_cvpr_2017/html/Amir_A_Low_Power_CVPR_2017_paper.html Bohte, S. M. (2011). Error-Backpropagation in Networks of Fractionally Predictive Spiking Neurons. In Artificial Neural Networks and Machine Learning https://doi.org/10.1007/978-3-642-21735-7_8 Dong, Y., Li, Y., Zhao, D., Shen, G., & Zeng, Y. (2023). Bullying10K: A Large-Scale Neuromorphic Dataset towards Privacy-Preserving Bullying Recognition. Advances in Neural Information Processing Systems, 36, 1923–1937. Li, C., Ma, L., & Furber, S. (2022). Quantization Framework for Fast Spiking Neural Networks. Frontiers in Neuroscience, 16. https://doi.org/10.3389/fnins.2022.918793 Li, H., Liu, H., Ji, X., Li, G., & Shi, L. (2017). CIFAR10-DVS: An Event-Stream Dataset for Object Classification. Frontiers in Neuroscience, 11. https://doi.org/10.3389/fnins.2017.00309 Lin, Y., Ding, W., Qiang, S., Deng, L., & Li, G. (2021). ES-ImageNet: A Million Event-Stream Classification Dataset for Spiking Neural Networks. Frontiers in Neuroscience, 15. [https://doi.org/10.3389/fnins.2021.726582](https://doi.org/10.3389/fnins.2021.726582 Neftci, E. O., Mostafa, H., & Zenke, F. (2019). Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks. IEEE Signal Processing Magazine. https://doi.org/10.1109/MSP.2019.2931595 Orchard, G., Jayawant, A., Cohen, G. K., & Thakor, N. (2015). Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades. Frontiers in Neuroscience, 9. https://doi.org/10.3389/fnins.2015.00437 Schuman, C. D., Kulkarni, S. R., Parsa, M., Mitchell, J. P., Date, P., & Kay, B. (2022). Opportunities for neuromorphic computing algorithms and applications. Nature Computational Science, 2(1), 10–19. https://doi.org/10.1038/s43588-021-00184-y Sengupta, A., Ye, Y., Wang, R., Liu, C., & Roy, K. (2019). Going Deeper in Spiking Neural Networks: VGG and Residual Architectures. Frontiers in Neuroscience, 13. https://doi.org/10.3389/fnins.2019.00095 Resources Open Neuromorphic (ONM) Collective Event-Based Vision Resources (https://github.com/uzh-rpg/event-based_vision_resources) - Upcoming workshops, papers, companies, neuromorphic systems, etc. Talks on Youtube [[Neuromorphic Computing from the Computer Science Perspective video from ICAS Lab with Dr Catherine Schuman]] [[Cosyne 2022 Tutorial on Spiking Neural Networks]] ESWEEK 2021 Dr. Priyadarshini Panda’s SNN Talk Intel Architecture All Access: Neuromorphic Computing Part 1 and Part 2 by Mike Davies. Spiking Neural Networks for More Efficient AI Algorithms Talk by Professor Chris Eliasmith at University of Waterloo Read More ›

CLIP, LLaVA, and the Brain - What the brain can teach us about visual processing

How do recent artificial neural networks, like the CLIP (Radford et al. 2021) and LLaVA (Liu et al. 2023) transformer networks, compare to the brain? Is there similarity between the attention in these networks to that in the brain? In this article I look at these transformer architectures with an eye on the similarity and differences with the mammalian brain and visual system. I come to the conclusion that the processing that vision transformers, CLIP, and LLaVA perform is analogous to a type of computation called pre-attentive visual processing. This processing is done in the initial feedforward visual responses to a stimulus before any recurrence. Although a lot can be accomplished in a feedforward way, studies have shown that feedforward pre-attentive processing in the brain does have difficulty with: Distinguishing the identity or characteristics of similar types of objects, especially when objects are close together or cluttered or the objects are unnatural or artificial (VanRullen 2007). More complex tasks such as counting or maze or curve tracing tasks. Perceiving objects that are more difficult to see, such as where it is difficult to perceive the boundaries of the objects. In contrast to the feed-forward only processing, one of the things that really stands out with the brain the richness in the interaction of areas, which I will discuss in more details in the next section. Bidirectional Activity in the Brain In most current deep learning architectures, activity is propagated in a single direction, for example, an image might be given as input to a network and then propagated from layer to layer until you get to a classification as the output. Figure 1: A simplified diagram showing some of the feed-forward and feedback connections in the Macaque brain. The areas that are earlier (or lower-level) are more white, while the areas that later or (higher-level) are more blue. The brain is much more interesting than these feedforward models. In the visual system, a stimulus will propagate from lower to higher level areas in a feedforward-like fashion, but then the higher level areas will also influence the lower level areas as shown in Figure 1. Some of this feedback is the conscious top-down attention that allows us to allocate more resources to objects and features of interest and allows us disambiguate stimuli that is either complex or ambiguous. Another part of this feedback is automatic and allows higher level areas to infuse the lower level areas with information that could not be known in just the feedforward manner. The conscious top-down attention is thought to support consciousness of visual stimuli. Without conscious access to lower level areas that encode borders and edges, we wouldn’t have as spatially precise perception of borders. Tasks such as mentally tracing a curve or solving a maze would become impossible. One example of the automatic unconscious feedback is border-ownership which is seen in about half of the orientation-selective neurons in visual area V2 (Zhou, Friedman, and von der Heydt 2000; Williford and von der Heydt 2013). These neurons will encode local information in about 40 ms and, as early as 10 ms after this initial response, will start to incorporate global context to resolve occlusions - holding the information needed to know which object are creating borders by occluding their backgrounds. Another example of this unconscious feedback was shown in Poort et al. (2012) using the images like that in Figure 2. In the Macaque early visual cortex V1, neurons will tend to initially (within 50-75 ms of stimulus presentation) encode only the local features within their receptive fields (e.g. green square). However, after around 75 ms, they will receive feedback from the higher level areas and they will tend to have a higher response when that texture belongs to a figure, such as this texture defined figure above. This happens even when attention is drawn away from the figure, however if the monkey is paying attention to the figure the neurons will tend to respond even more. Figure 2: Image from (Poort et al. 2012). Shapes that are defined only by texture, like the above, can be difficult to see in a pure “feed-forward” manner. The biological visual system is able to recognize shapes like these through the interaction of lower and higher level areas, including top-down attention and subconscious processes. One way to look at this bidirectional interaction is that at any given time, each neuron greedily uses all available predictive signals. Even higher level areas can be informative. Transformers With all the talk about attention with the introduction of transformers (Vaswani et al. 2017) and with the ability to generate sentences one word at a time, you might be led to believe that transformers have recurrence. However, there is no “state” that is kept between the steps of the transformer, except for the previous output. So at best the recurrence is very limited and there is no bidirectionality that is ubiquitous in the brain. Transformers do allow for multi-headed attention, which could be interpreted as being able to attend to multiple things simultaneously. In the original paper, the transformer used 8 attention heads. Image transformers can be seen as analogous to pre-attentive feedforward processing with some modifications, like with the multiple attention heads. CLIP Figure 3: Image from Radford et al. (2021) depicting how CLIP is trained. $I_1$ and $T_1$ are the encodings of image 1 and the corresponding caption. A contrastive learning loss is used to make the $I_i$ and $T_j$ more similar when $i=j$ and more dissimilar when $i≠j$. Weights are trained from scratch. CLIP was introduced by OpenAI in the Radford et al. (2021) paper “Learning Transferable Visual Models from Natural Language Supervision”. The idea behind CLIP is pretty simple and is shown in Figure 3. It takes a bunch of image and caption pairs from the Internet, feeds the image to an image encoder or and the text to a text encoder. It then uses a loss that brings the encoding of the image and the encoding of the text closer together when they are in the same pair, otherwise the loss increases the distance of the encodings. This is what CLIP gives you: the ability to compare the similarity between text and images. One way this can be used is for zero-shot classification, as shown in Figure 4. CLIP does not, by itself, generate text descriptions from images. The image encoder and text encoder are independent, meaning that there is no way for task-driven modulation to influence the image encoding. This means that the image encoder has to encode everything that could be potentially relevant to the task. Typically the resolution of the input image is pretty small, which helps prevent the computation and memory requirements from exploding. Figure 4: Image from Radford et al. (2021) depicting how CLIP can be used for zero-shot classification. Text encodings are generated for each class $T_1\ldots T_N$. The image is then encoded and the similarity is measured with the generated text encodings. The most similar text encoding is the chosen class. LLaVA Figure 5: LLaVA architecture from Liu et al. (2023). $\mathrm X_v$: image, $\mathrm X_c$ : caption, $\mathrm X_q$ : question derived from $\mathrm X_c$ using GPT4 Large Language and Vision Assistant (LLaVA) (Liu et al. 2023) is a large language and vision architecture that extends and builds onto CLIP to add the ability to describe and answer questions about images. This type of architecture is interesting to me because it can attempt tasks that are similar to those used in Neuroscience and Psychology. LLaVA takes the vision transformer model ViT-L/14 that is trained by CLIP for image encoding Figure 5. To convert the encodings into tokens, the first paper uses a single linear projection matrix $W$ for this transformation. The tokens calculated from the images $H_v$ and the tokens from the text instructions $H_q$ are provided as input. LLaVA can then generate the language response $X_a$ one token at a time, each time appending the response so far as the input to the next iteration. I won’t go into the details of how LLaVA is trained, but it is interesting how they use ChatGPT to expand the caption ($\mathrm X_c$ in Figure 5) to form instructions ($\mathrm H_q$) and responses (used to train $\mathrm X_a$) about an image and the use of bounding box information. In version 1.5 of LLaVA (Liu et al. 2024), some of the improvements they made include: The linear projection matrix $\mathrm W$ is replaced with a multilayer perceptron The image resolution is increased by using an image encoder that takes images of size 336x336 pixels and split the images into grids that are encoded separately. Task driven attention in the brain is able to dynamically allocate resources to the object, location, or features of interest, which can allow processing of information that could otherwise be overwhelmed by clutter or other objects. In LLaVA, the image encoder is independent of the text instructions, so to be successful it needs to make sure any potentially useful information is stored in the image tokens ($\mathrm H_v$). Conclusion Since LLaVA and CLIP lack bidirectional processing, the processing that they do is limited. This is especially true for image processing, since image processing is done independent of the text instructions. Most convolutional neural networks also shares these limitations. This leads me to my conjecture: Conjecture: Most convolutional, vision transformer, and multimodal transformer networks is restricted to something pre-attentive feedforward visual processing. This is not necessarily a criticism as much as an insight that can be informative. Feedforward processing can do a lot and is fast. However, it is not as dynamic as to what resources can be used to be used, which can lead to informational bottlenecks in cluttered scenes and is unable to encode enough information for complex tasks without an explosion of the size of the encodings. There are some networks that are not limited to pre-attentive feedforward networks, but currently most of the architectures lag behind those of transformers. These include, long-short term memory models (LSTMs) and, more recently, the Mamba architecture which has several benefits over transformers (Gu and Dao 2024). Extended LSTMs (Beck et al. 2024; Alkin et al. 2024) have been proposed that help make up some of the ground between transformers and LSTMs. References Alkin, Benedikt, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, and Johannes Brandstetter. 2024. “Vision-LSTM: xLSTM as Generic Vision Backbone.” June 6, 2024. http://arxiv.org/abs/2406.04303. Beck, Maximilian, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. 2024. “xLSTM: Extended Long Short-Term Memory.” May 7, 2024. http://arxiv.org/abs/2405.04517. Gu, Albert, and Tri Dao. 2024. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” May 31, 2024. http://arxiv.org/abs/2312.00752. Liu, Haotian, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. “Improved Baselines with Visual Instruction Tuning.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 26296–306. https://openaccess.thecvf.com/content/CVPR2024/html/Liu_Improved_Baselines_with_Visual_Instruction_Tuning_CVPR_2024_paper.html. Liu, Haotian, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. “Visual Instruction Tuning.” December 11, 2023. https://doi.org/10.48550/arXiv.2304.08485. Poort, Jasper, Florian Raudies, Aurel Wannig, Victor A F Lamme, Heiko Neumann, and Pieter R Roelfsema. 2012. “The Role of Attention in Figure-Ground Segregation in Areas V1 and V4 of the Visual Cortex.” Neuron 75 (1): 143–56. https://doi.org/10.1016/j.neuron.2012.04.032. Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, and Jack Clark. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In International Conference on Machine Learning, 8748–63. PMLR. http://proceedings.mlr.press/v139/radford21a. VanRullen, Rufin. 2007. “The Power of the Feed-Forward Sweep.” Advances in Cognitive Psychology 3 (1-2): 167. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864977/. Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30. https://proceedings.neurips.cc/paper/7181-attention-is-all. Williford, Jonathan R., and Rudiger von der Heydt. 2013. “Border-Ownership Coding.” Scholarpedia 8 (10): 30040. http://scholarpedia.org/article/Border-ownership_coding. Zhou, H., H. S. Friedman, and R. von der Heydt. 2000. “Coding of Border Ownership in Monkey Visual Cortex.” The Journal of Neuroscience 20 (17): 6594–6611. Read More ›

Review of Ioffe & Szegedy 2015 *Batch normalization*

Normalization of training inputs has long been shown to increase the speed of learning in networks. The paper (Ioffe and Szegedy 2015) introduces a major improvement in deep learning, batch normalization (BN), which extends this idea by normalizing the activity within the network, across mini-batches (batches of training examples). BN has been gaining a lot of traction in the academic literature, for example being used to improve segmentation (Hong, Noh, and Han 2015) and variational autoencoders (Sønderby et al. 2016). The authors state that adding BN allows a version of the Inception image classification model to learn with 14 times fewer training steps, when additional modifications are made in order to take advantage of BN. One of the modifications is removing the Dropout layers, because BN acts as a regularizer and actually eliminates the need for Dropout. It also allows for the learning rate to be increased. It does all this while actually adding a small number of parameters to be learned during training. A non-batch version of BN may even have a biological homolog: homeostatic plasticity. BN separates the learning of the overall distribution of the activity of the neuron and the specific synaptic weights. For each “activation” $x^{(k)}$, the parameters for the mean and spread of the distribution of the activation is given by the learned parameters $\beta^{(k)}$ and $\gamma^{(k)}$ respectively. Details The original paper (Ioffe and Szegedy 2015) states that the normalization should be done per activation $k$. In the initial part of the paper the definition of activation is left open. In their experiments, however, they do the normalization across each feature map (across batches and locations, for a specific feature). Batch normalization step For now, this section just regurgitates some of the basic information from the original paper. Let $x^{(k)}_i$ be a specific activation $k$ for a given input$i$. Batch normalization then normalizes this activation over all the inputs of the batch (mini-batch) of inputs $i \in \{ 1 \ldots m \}$. BN normalizes the data to a Gaussian where the mean and variation of the Gaussian is learned during training. This is done by first normalizing the data to the standard Gaussian ($\mu=0$ and $\sigma=1$), and then adding the offsets $\beta^{(k)}$ and scaling by $\gamma^{(k)}$. Let $\mathcal B = \left\{ x_{1 \ldots m}\right\}$ be a given batch. $\mu ^{k}_ \mathcal B$ and ${( ^{k}_B )} ^ 2 $ be the mean and variance of a given activation, $k$, across the batch of training inputs. The normalization / whitening step is then: \[ \hat x_i = \frac {x_i - \mu_{\mathcal B}} {\sqrt{\sigma_{\mathcal B}^2 + \epsilon}}. \] And then there is the re-scaling and shifting step: \[ y^{(k)}_i = \gamma^{(k)} \hat x_i + \beta^{(k)}, \] where $\gamma^{(k)}$ and $\beta^{(k)}$, once again, are learned parameters. Discussion on Caffe’s Implementation There is an interesting discussion on Caffe’s implementation in the pull request (PR): https://github.com/BImplementationVLC/caffe/pull/3229 Modifying models for BN Adding BN by itself can speedup training. However, in order to fully take advantage of the BN, additional steps need to be made. The authors suggestions include: Increase the learning rate (how much?). Remove Dropout. Reduce the L2 weight regularization by a factor of 5. Increase the learning rate decay by 6. Perform “within-shard” shuffling - although I don’t know what this is. References Hong, Seunghoon, Hyeonwoo Noh, and Bohyung Han. 2015. “Decoupled Deep Neural Network for Semi-Supervised Semantic Segmentation.” In Advances in Neural Information Processing Systems 28, edited by C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, 1495–1503. Curran Associates, Inc. http://papers.nips.cc/paper/5858-decoupled-deep-neural-network-for-semi-supervised-semantic-segmentation.pdf. Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167 [Cs], February. http://arxiv.org/abs/1502.03167. Sønderby, Casper Kaae, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. 2016. “How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks.” arXiv:1602.02282 [Cs, Stat], February. http://arxiv.org/abs/1602.02282. Read More ›

Review of Ichida, Schwabe, Bressloff, & Angelucci (2007) 'Response Facilitation From the “Suppressive” Receptive Field Surround of Macaque V1 Neurons'

Overview The extraclassical surround (ECS) generally suppresses the firing rate of visual neurons in the primary visual cortex (V1), especially when the surround stimulus has the same orientation (iso-oriented). However, it has been shown that the ECS can actually enhance the firing rate when the stimulus has a low contrast. In (Ichida et al. 2007), the authors test a prediction from a model they have published (Schwabe et al. 2006): that the far ECS, and not just the immediate ECS, can enhance the response. They find that the far ECS can indeed enhance the response, but only when the immediate ECS does not contain the iso-oriented stimulus. Methods The authors define the classical receptive field (CRF, called the minimum response field in the paper) as the region that can be driven using a small high contrast 0.1° grating. The size of the CRF depends on the contrast of the stimulus. It is larger for low contrast stimuli than high contrast stimuli. The authors define the immediate ECS as the area beyond the high contrast CRF where a low contrast stimulus would increase the response. Instead of using the CRF, they used what they call the high-contrast and low-contrast summation RF. They first found the CRF (minimum response field) by using a 0.1° grating. They then used this to center a high-contrast grating patch. They then varied the size of the patch and found the size that optimally simulated the cell. They called this the high contrast summation RF (SRF_high) or simply the RF center. They used the same protocol with a low contrast grating to find the low-contrast summation RF (SRF_low). They called the region between SRF_high and SRF_low the near surround. The far surround’s outer diameter was set to 14°. The inner diameter varied but was no smaller than the SRF_low. References Ichida, Jennifer M., Lars Schwabe, Paul C. Bressloff, and Alessandra Angelucci. 2007. “Response Facilitation From the ‘Suppressive’ Receptive Field Surround of Macaque V1 Neurons.” Journal of Neurophysiology 98 (4): 2168–81. https://doi.org/10.1152/jn.00298.2007. Schwabe, Lars, Klaus Obermayer, Alessandra Angelucci, and Paul C. Bressloff. 2006. “The Role of Feedback in Shaping the Extra-Classical Receptive Field of Cortical Neurons: A Recurrent Network Model.” J. Neurosci. 26 (36): 9117–29. https://doi.org/10.1523/JNEUROSCI.1253-06.2006. Read More ›

Review of Zoccolan et al. 2005 Multiple

Even though the neurons in inferotemporal cortex (IT) have very large receptive fields, it is tempting the believe that the neurons would be able to distinguish objects presented within their receptive fields. For example, if a neuron responds to object A and B at different rates, perhaps the neuron should give the maximum of these two rates when both stimuli are presented within their receptive field. The study (Zoccolan, Cox, and DiCarlo 2005) shows that this is not the case and, when presented with two objects, most IT neurons’ responses are the mean of the firing rates when the objects are presented separately - at least for short presentation times and when the objects are not attended. There is a lot more to this paper than what I will cover in this review / note. I hope to add more in the future, but the most important points are straightforward. They use simple artificial shapes on a plain background. The first results show that in the population, the cells’ responses to the presentation of multiple objects cluster around the mean of their responses of when the objects are presented separately. There is slight tendency to fire at a rate slightly higher than the average, but the lack of scatter is rather amazing. There is a line in Figure 1C and 1D for the sum responses and very few of the cells fall on or above this line. They then show that the responses to the combined object displays are much more like the mean of the responses to individual object displays than a max model, at least in the mean cell population. There is a lot of spread in these results, leaving open the possibility that some neurons give a response that is the maximum of the response to the two objects separately (or having an even higher response). Zoccolan, Davide, David D. Cox, and James J. DiCarlo. 2005. “Multiple Object Response Normalization in Monkey Inferotemporal Cortex.” J. Neurosci. 25 (36): 8150–64. https://doi.org/10.1523/JNEUROSCI.2058-05.2005. Read More ›

Review of Liu, Hashemi-Nezhad, & Lyon (2015) 'Contrast invariance of orientation tuning in cat primary visual cortex neurons depends on stimulus size'

Overview There are two main findings from (Liu, Hashemi-Nezhad, and Lyon 2015) in the the primary visual cortex (V1) using anesthetized cat. First, that contrast invariance orientation tuning depends on having a stimulus that extends beyond the CRF. If the stimulus is optimized for the CRF, then the tuning width decreases with lower contrast (illustrated in Figure 3 of the paper). The orientation tuning profile is invariant when the stimulus extends to the surround, but when is only covers the CRF. The second main finding (illustrated in Figure 4 of the paper) is that contrast invariance appears with the large stimulus because the tuning width decreases in the high contrast stimulus when the surround stimulus is added to the CRF stimulus. The tuning width for the low contrast conditions on average stays the same with or without the stimulus in the surround (although individual cells may be facilitated or suppressed). This results of (Liu, Hashemi-Nezhad, and Lyon 2015) are difficult to reconcile with classical results and, for me, indicate that a better measure of contrast-invariant orientation tuning is needed. This paper should definitely be read for anyone interested in this feature. Stimulus and Methods For the main experiment, they have two contrast conditions (low and high) that are defined for each neuron and two size conditions (CRF and CRF+ECS) that are defined for each contrast (and neuron). The smaller of the two sizes, the CRF / patch condition, is defined as the size that produces the largest response from the cell. The larger size, the CRF+ECS (extraclassical surround) condition, is defined by the size that produces the maximum suppression. The paper almost exclusively reports the half-width at half height (HWHH). This is half the width of the (fitted) orientation tuning curve that elicits half of the maximum response of that tuning curve. Discussion The paper states in the discussion that most other papers on this topic did not use the optimally sized stimulus, hence why they report different results. They do point out that (finn_contrast-invariant_2007?) did use a similar CRF condition, but reported different results presumably because they used patch clamping. In Supplemental Fig. 3 of Finn et al., there are some extracellularly recorded neurons that reportedly are more consistent (I haven’t checked results). Deep anesthesia is known to change the properties of ECS of early visual neurons. It is unclear to me how much the results from anesthesized animals can be generalized to the normal awake state. Liu, Yong-Jun, Maziar Hashemi-Nezhad, and David C. Lyon. 2015. “Contrast Invariance of Orientation Tuning in Cat Primary Visual Cortex Neurons Depends on Stimulus Size.” J Physiol 593 (19): 4485–98. https://doi.org/10.1113/JP271180. Read More ›

Change konsole appearance during SSH

Everyone knows that feeling: when you have many consoles open at the same time connected via ssh to various servers. In this post I’m going to show a simple trick that allows you to change the background whenever you ssh to a server and changes it back when you logout - well, at least if you are using KDE (or have konsole installed). For example, I have a virtual linux system that I call “Puffin”. I’ve created an alias “ssh-puffin” to login via ssh. Before ssh session I have setup this alias to change the background of konsole: During ssh session And then, after I log out, the konsole switches back to the local profile (and gives a warm and fuzzy welcome-back message). After ssh session Step 1: Add konsole profile(s) Create konsole profiles and corresponding color schemes for your local system (“Local”) and remote systems (“Puffin”). You only need to really create the color schemes, but I always create a separate profile with the same name. This is done by going to Settings of a konsole window and selecting “Manage Profiles”. You can access the color schemes by clicking edit (or new) and then clicking on Appearance. I created the Puffin background with GIMP using layers and an image from Wikimedia Commons by Richard Bartz. You can, of course, change the console appearance in other ways. Step 2: Modify .bashrc Add the following to your .bashrc file: alias resetcolors="konsoleprofile colors=Local" alias ssh-puffin="konsoleprofile colors=Puffin; ssh puffin; resetcolors; echo 'Welcome back'"'!' If you have many remote servers, you may want to add your .bashrc file to github or the cloud™. Step 3: Enjoy awesomeness After reloading .bashrc, you can then log into the server using your alias. Acknowledgements I first figured out how to do this from a blog post by Abdussamad. Read More ›

Backpropagation with shared weights in convolutional neural networks

The success of deep convolutional neural networks would not be possible without weight sharing - the same weights being applied to different neuronal connections. However, this property also makes them more complicated. This post aims to give an intuition of how backpropagation works with weight sharing. For a more well-rounded introduction to backpropagation of convolutional neural networks, see Andrew Gibiansky’s blog post. Backpropagation is used to calculate how the error in a neural network changes with respect to changes in a weight $w$ in that neural network. In other words, it calculates: \[\frac{\partial E}{\partial w}, \] where $E$ is the error and $w$ is a weight. For traditional feed-forward neural networks, each connection between two neurons has it’s own weight and the calculation of the backpropagation is generally straightforward using the chain rule. For example, if you know how the error changes with respect the node $y_i$ (ie. $\frac{\partial E}{\partial y_i}$), then calculating the contribution of the pre-synaptic weights of that node is simply: \[\frac{\partial E}{\partial w}=\frac{\partial E}{\partial y_i}\frac{\partial y_i}{\partial w}. \] This is complicated in convolutional neural networks because the weight $w$ is used for multiple nodes (often, most or all nodes in the same layer). Handling shared weights In classical convolutional neural networks, shared weights are handled by summing together each instance that the weight appears in backpropagation derivation, instead of, for example, taking the average of each occurrence. So, if layer $y^l$ is the layer “post-synaptic” to the weight $w$ and we have calculated the effect of layer on the error ($\frac{\partial E}{\partial y^l}$), then the weights are: \[\frac{\partial E}{\partial w}=\sum_i\frac{\partial E}{\partial y^l_i} \frac{\partial y^l_i}{\partial w}, \] where $i$ specifies the node within layer $l$. So why is summation the correct operation? In essence, it is because when the paths from a weight (applied at different locations) merge, they do so with summation. For example, convolution involves summing the paths (in the dot-operation). Other operations such as max pooling and fully connected layers also involve summing the separate paths. Simple example Let’s take a very simple convolutional network. Let layer $y^0$ be a 2D input layer and $[w_0, 0, 0]$ a kernel that is applied to this convolutional layer. For simplicity, lets only have a single kernel. Then: \[ x^1_{i}=w_0 y^0_{i} \] An activation function is then applied to this result: $y^1_i=h(x^1_{i})$. For the next convolutional layer, let’s say that the kernel $[w_1,w_2,w_3]$ is applied. Then: \[ \begin{aligned} x^2_{i}&=\sum_{a=1}^3 w_a y^1_{i+a-1} \\ &= w_1 y^1_i + w_2 y^1_{i+1} + w_3 y^1_{i+2} \\ &= w_1 h\left(w_0 y^0_{i}\right) + w_2 h\left(w_0 y^0_{i+1}\right) + w_3 h\left(w_0 y^0_{i+2}\right). \\ \end{aligned} \] and \[ y^2_{i} = h(x^2_{i}). \] So we are interested in $\frac{\partial E}{\partial w_0}$. Let’s say that the error is only effected by the $j$th node of the output: $y^2_{j}$. Then: \[\frac{\partial E}{\partial w_0} = \frac{\partial E}{\partial y^2_{i}}\frac{\partial y^2_{j}}{\partial x^2_j}\frac{\partial x^2_{j}}{\partial w_0} \] Assume that we have $\frac{\partial E}{\partial y^2_{j}}$ and $\frac{\partial y^2_{j}}{\partial x^2_j}$, then we only need to solve for $\frac{\partial x^2_{j}}{\partial w_0}$. \[ \begin{aligned} \frac{\partial x^2_{j}}{\partial w_0}&=\frac{\partial}{\partial w_0} \left(\sum_{a=1}^3 w_a y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( y^1_{j+a-1}\right)\\ &= \sum_{a=1}^3 w_a \frac{\partial}{\partial w_0} \left( h\left(w_0 y^0_{j+a-1}\right)\right)\\ &= w_1 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j}\right) + w_2 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+1}\right) + w_3 \frac{\partial}{\partial w_0} h\left(w_0 y^0_{j+2}\right). \\ \end{aligned} \] Notice that each occurrence of $w_0$ is summed separately, and hence why backpropagation sums the shared weights in convolutional networks. Read More ›

Passwordless ssh authentication!

In your local system, check to see if you have the following files: ~/.ssh/id_rsa ~/.ssh/id_rsa.pub If not, type: ssh-keygen -t rsa And follow the instructions. Note that ssh-agent can be used to securely save your passphrase. After you have generate your private and public keys, you want to give your remote system the public key: ssh-copy-id -i ~/.ssh/id_rsa.pub username@remote.system After entering your password, you’re done! Reference: http://www.debian-administration.org/articles/152 Read More ›