All Things Attention: Bridging Different Perspectives on Attention

Four colorful panels of cartoons of robots and people looking at images of eyes

Generated by DALL-E

In conjunction with NeurIPS 22: December 2, 2022. Hybrid in-person and virtual.

Attention is a widely popular topic studied in many fields such as neuroscience, psychology, and machine learning. A better understanding and conceptualization of attention in both humans and machines has led to significant progress across fields. At the same time, attention is far from a clear or unified concept, with many definitions within and across multiple fields.

Cognitive scientists study how the brain flexibly controls its limited computational resources to accomplish its objectives. Inspired by cognitive attention, machine learning researchers introduce attention as an inductive bias in their models to improve performance or interpretability. Human-computer interaction designers monitor people’s attention during interactions to implicitly detect aspects of their mental states.

While the aforementioned research areas all consider attention, each formalizes and operationalizes it in different ways. Bridging this gap will facilitate:

Topics of Interest

The All Things Attention workshop aims to foster connections across disparate academic communities that conceptualize “attention” such as Neuroscience, Psychology, Machine Learning, and Human Computer Interaction. Workshop topics of interest include:

Relationships between biological and artificial attention

Attention for reinforcement learning and decision making

Attention mechanisms for continual / lifelong learning

Attention for interpretation and explanation

Attention in human-computer interaction

Attention mechanisms in Deep Neural Network (DNN) architectures

Ways to Participate

Recording of the event is available here.


Ask questions on slido: Use this link or the embedded page here:

RocketChat and NeurIPS workshop page

You can participate in live discussions during the workshop here. Invited speakers will also be encouraged to answer questions offline using the same link. Please note you need to be registered for the workshop to access RocketChat.

Virtual poster session and hangouts

The virtual poster session will be held on the Discord server — see the NeurIPS page for a link. It’ll be open for the whole day, so feel free to jump in just to chat!


Time in CST Event
9:00 AM - 11:00 AM
9:00 AM - 9:05 AM
9:05 AM - 9:25 AM
9:25 AM - 9:45 AM
9:45 AM - 10:05 AM
10:05 AM - 10:25 AM
10:25 AM - 11:00 AM
Talks Session I
Workshop Intro
Ida Momennejad
James Whittington
Henny Admoni
Tobias Gerstenberg
Spotlight Talks:
  • Foundations of Attention Mechanisms in Deep Neural Network Architectures
  • Is Attention Interpretation? A Quantitative Assessment On Sets
11:00 AM - 12:00 PM In-Person Panel Discussion
Panelists: Megan deBettencourt, Tobias Gerstenberg, Erin Grant, Ida Momennejad, Ramakrishna Vedantam, James Whittington, Cyril Zhang
12:00 PM - 1:00 PM Lunch and Virtual Social Event
1:00 PM - 2:00 PM Coffee Break / Poster Session
2:00 PM - 4:00 PM
2:00 PM - 2:20 PM
2:20 PM - 2:40 PM
2:40 PM - 3:00 PM
3:00 PM - 3:20 PM
3:20 PM - 4:00 PM
Talks Session II
Shalini De Mello
Pieter Roelfsema
Erin Grant
Vidhya Navalpakkam
Spotlight Talks:
  • Wide Attention Is The Way Forward For Transformers
  • Fine-tuning hierarchical circuits through learned stochastic co-modulation
  • Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement
4:00 PM - 5:00 PM Coffee Break / Poster Session
5:00 PM - 6:00 PM Virtual Panel Discussion
Panelists: Henny Admoni, David Ha, Brian Kingsbury, John Langford, Shalini De Mello, Vidhya Navalpakkam, Ashish Vaswani

Invited Speakers

Ida Momennejad
Principal Researcher, Microsoft Research

Attention in Task-sets, Planning, and the Prefrontal Cortex

What we pay attention to depends on the context and the task at hand. On the one hand, the prefrontal cortex can modulate how to direct attention outward to the external world. On the other hand, attention to internal states enables metacognition and configuration of internal states using repertoires of memories and skills. I will first discuss ongoing work in which, inspired by the role of attention in affordances and task-sets, we analyze large scale game play data in the XboX 3D game Bleeding Edge in an interpretable way. I will briefly mention ongoing directions including decoding of plans during chess based on eye-tracking. I will conclude with how future models of multi-scale predictive representations could include prefrontal cortical modulation during planning and task performance.

James Whittington
Postdoc, University of Oxford

Relating Transformers to Models and Neural Representations of the Hippocampal Formation

Many deep neural network architectures loosely based on brain networks have recently been shown to replicate neural firing patterns observed in the brain. One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience. We additionally show the transformer version offers dramatic performance gains over the neuroscience version. This work continues to bind computations of artificial and brain networks, offers a novel understanding of the hippocampal-cortical interaction, and suggests how wider cortical areas may perform complex tasks beyond current neuroscience models such as language comprehension.

Henny Admoni
A. Nico Habermann Assistant Professor, Carnegie Mellon University

Eye Gaze in Human-Robot Collaboration

In robotics, human-robot collaboration works best when robots are responsive to their human partners’ mental states. Human eye gaze has been used as a proxy for one such mental state: attention. While eye gaze can be a useful signal, for example enabling intent prediction, it is also a noisy one. Gaze serves several functions beyond attention, and thus recognizing what people are attending to from their eye gaze is a complex task. In this talk, I will discuss our research on modeling eye gaze to understand human attention in collaborative tasks such as shared manipulation and assisted driving.

Tobias Gerstenberg
Assistant Professor of Cognitive Psychology, Stanford University

Attending to What's Not There

When people make sense of the world, they don’t only pay attention to what’s actually happening. Their mind also takes them to counterfactual worlds of what could have happened. In this talk, I will illustrate how we can use eye-tracking to uncover the human mind’s forays into the imaginary. I will show that when people make causal judgments about physical interactions, they don’t just look at what actually happens. They mentally simulate what would have happened in relevant counterfactual situations to assess whether the cause made a difference. And when people try to figure out what happened in the past, they mentally simulate the different scenarios that could have led to the outcome. Together these studies illustrate how attention is not only driven by what’s out there in the world, but also by what’s hidden inside the mind.

Shalini De Mello
Principal Research Scientist, NVIDIA

Exploiting Human Interactions to Learn Human Attention

Unconstrained eye gaze estimation using ordinary webcams in smart phones and tablets is immensely useful for many applications. However, current eye gaze estimators are limited in their ability to generalize to a wide range of unconstrained conditions, including, head poses, eye gaze angles and lighting conditions, etc. This is mainly due to the lack of availability of gaze training data in in-the-wild conditions. Notably, eye gaze is a natural form of human communication while humans interact with each other. Visual data (videos or images) containing human interaction are also abundantly available on the internet and are constantly growing as people upload more. Could we leverage visual data containing human interaction to learn unconstrained gaze estimators? In this talk we will describe our foray into addressing this challenging problem. Our findings point to the great potential of human interaction data as a low cost and ubiquitously available source of training data for unconstrained gaze estimators. By lessening the burden of specialized data collection and annotation, we hope to foster greater real-word adoption and proliferation of gaze estimation technology in end-user devices.

Pieter Roelfsema
Department Head, Netherlands Institute for Neuroscience

BrainProp: How Attentional Processes in the Brain Solve the Credit Assignment Problem

Humans and many other animals have an enormous capacity to learn about sensory stimuli and to master new skills. Many of the mechanisms that enable us to learn remain to be understood. One of the greatest challenges of systems neuroscience is to explain how synaptic connections change to support maximally adaptive behaviour. We will provide an overview of factors that determine the change in the strength of synapses. Specifically, we will discuss the influence of attention, neuromodulators and feedback connections in synaptic plasticity and suggest a specific framework, called BrainProp, in which these factors interact to improve the functioning of the entire network.

Much recent work focuses on learning in the brain using presumed biologically plausible variants of supervised learning algorithms. However, the biological plausibility of these approaches is limited, because there is no teacher in the motor cortex that instructs the motor neurons. Instead, learning in the brain usually depends on reward and punishment. BrainProp is a biologically plausible reinforcement learning scheme for deep networks with an any number of layers. The network chooses an action by selecting a unit in the output layer and uses feedback connections to assign credit to the units in lower layers that are responsible for this action. After the choice, the network receives reinforcement so that there is no need for a teacher. We showed how BrainProp is mathematically equivalent to error backpropagation, for one output unit at a time (Pozzi et al., 2020). We illustrate learning of classical and hard image-classification benchmarks (MNIST, CIFAR10, CIFAR100 and Tiny ImageNet) by deep networks. BrainProp achieves an accuracy that is equivalent to that of standard error-backpropagation, and better than other state-of-the-art biologically inspired learning schemes. Additionally, the trial-and-error nature of learning is associated with limited additional training time so that BrainProp is a factor of 1-3.5 times slower. These results provide new insights into how deep learning may be implemented in the brain.

Erin Grant
Senior Research Fellow, University College London

Attention as Interpretable Information Processing in Machine Learning Systems

Attention in psychology and neuroscience conceptualizes how the human mind prioritizes information as a result of limited resources. Machine learning systems do not necessarily share the same limits, but implementations of attention have nevertheless proven useful in machine learning across a broad set of domains. Why is this so? I will focus on one aspect: interpretability, which is an ongoing challenge for machine learning systems. I will discuss two different implementations of attention in machine learning that tie closely to conceptualizations of attention in two domains of psychological research. Using these case studies as a starting point, I will discuss the broader strengths and drawbacks of using attention to constrain and interpret how machine learning systems process information. I will end with a problem statement highlighting the need to move away from localized notions to a global view of how attention-like mechanisms modulate information processing in artificial systems.

Vidhya Navalpakkam
Principal Scientist, Google Research

Accelerating Human Attention Research via ML Applied to Smartphones

Attention and eye movements are thought to be a window to the human mind, and have been extensively studied across Neuroscience, Psychology and HCI. However, progress in this area has been severely limited as the underlying methodology relies on specialized hardware that is expensive (upto $30,000) and hard to scale. In this talk, I will present our recent work from Google, which shows that ML applied to smartphone selfie cameras can enable accurate gaze estimation, comparable to state-of-the-art hardware based devices, at 1/100th the cost and without any additional hardware. Via extensive experiments, we show that our smartphone gaze tech can successfully replicate key findings from prior hardware-based eye movement research in Neuroscience and Psychology, across a variety of tasks including traditional oculomotor tasks, saliency analyses on natural images and reading comprehension. We also show that smartphone gaze could enable applications in improved health/wellness, for example, as a potential digital biomarker for detecting mental fatigue. These results show that smartphone-based attention has the potential to unlock advances by scaling eye movement research, and enabling new applications for improved health, wellness and accessibility, such as gaze-based interaction for patients with ALS/stroke that cannot otherwise interact with devices.


Ida Momennejad
Principal Researcher, Microsoft Research
James Whittington
Postdoc, University of Oxford
Henny Admoni
A. Nico Habermann Assistant Professor, Carnegie Mellon University
Tobias Gerstenberg
Assistant Professor of Cognitive Psychology, Stanford University
Shalini De Mello
Principal Research Scientist, NVIDIA
Erin Grant
Senior Research Fellow, University College London
Vidhya Navalpakkam
Principal Scientist, Google Research
Megan deBettencourt
Postdoc, University of Chicago
David Ha
Head of Strategy, Stability AI
Ramakrishna Vedantam
Research Scientist, Facebook AI Research (FAIR)
Cyril Zhang
Senior Researcher, Microsoft Research
Ashish Vaswani
Chief Scientist and Co-Founder, Adept AI Labs
Brian Kingsbury
Distinguished Research Scientist and Manager, IBM Research
John Langford
Partner Research Manager, Microsoft Research

Accepted Papers

Oral Presentations

Fine-tuning hierarchical circuits through learned stochastic co-modulation Caroline Haimerl, Eero P Simoncelli, Cristina Savin
Targeted stochastic co-modulation in the brain introduces a label of task-relevant information that can help fine-tune a hierarchical model of the visual system for a new task
Attentional gating is a core mechanism supporting behavioral flexibility, but its biological implementation remains uncertain. Gain modulation of neural responses is likely to play a key role, but simply boosting relevant neural responses can be insufficient for improving behavioral outputs, especially in hierarchical circuits. Here we propose a variation of attentional gating that relies on {\em stochastic} gain modulation as a dedicated indicator of task relevance, which guides task-specific readout adaptation. We show that targeted stochastic modulation can be effectively learned and used to fine-tune hierarchical architectures, without reorganization of the underlying circuits. Simulations of such networks demonstrate improvements in learning efficiency and performance in novel tasks, relative to traditional attentional mechanisms based on deterministic gain increases. The effectiveness of this approach relies on the availability of representational bottlenecks in which the task relevant information is localized in small subpopulations of neurons. Overall, this work provides a new mechanism for constructing intelligent systems that can flexibly and robustly adapt to changes in task structure.
Foundations of Attention Mechanisms in Deep Neural Network Architectures Pierre Baldi, Roman Vershynin
We classify all attention mechanisms, identify the most important one, and prove several theorems about their capacity.
We consider the foundations of attention mechanisms in deep neural network architectures and present three main results. First, we provide a systematic taxonomy of all possible attention mechanisms within, or as extensions of, the McCulloch and Pitt standard model into 18 classes depending on the origin type of the attention signal, the target type of the attention signal, and whether the interaction type is additive or multiplicative. Second, using this taxonomy, we identify three key attention mechanisms: output gating, synaptic gating, and multiplexing. Output gating and synaptic gating are extensions of the standard model and all current attention-based architectures, including transformers, use either output gating or synaptic gating, or a combination of both. Third, we develop a theory of attention capacity and derive mathematical results about the capacity of basic attention networks. For example, the output gating of a linear threshold gate of $n$ variables by another linear threshold gate of the same $n$ variables has capacity $2n^2 (1+o(1))$. Perhaps surprisingly, multiplexing attention is used in the proofs of these results. Synaptic and output gating provide computationally efficient extensions of the standard model allowing for {\it sparse} quadratic activation functions. They can also be viewed as primitives enabling the concise collapsing of multiple layers of processing in the standard model.
Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement Michael Chang, Alyssa Li Dayan, Franziska Meier, Thomas L. Griffiths, Sergey Levine, Amy Zhang
We demonstrate how to generalize over a combinatorially large space of rearrangement tasks from only pixel observations by constructing from video demonstrations a factorized transition graph over entity state transitions that we use for control.
Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks.
Is Attention Interpretation? A Quantitative Assessment On Sets Jonathan Haab, Nicolas Deutschmann, Maria Rodriguez Martinez
We test the interpretability of attention weights by designing Multiple Instance Learning synthetic datasets with ground-truth instance-level labels.
The debate around the interpretability of attention mechanisms is centered on whether attention scores can be used as a proxy for the relative amounts of signal carried by sub-components of data. We propose to study the interpretability of attention in the context of set machine learning, where each data point is composed of an unordered collection of instances with a global label. For classical multiple-instance-learning problems and simple extensions, there is a well-defined “importance” ground truth that can be leveraged to cast interpretation as a binary classification problem, which we can quantitatively evaluate. By building synthetic datasets over several data modalities, we perform a systematic assessment of attention-based interpretations. We find that attention distributions are indeed often reflective of the relative importance of individual instances, but that silent failures happen where a model will have high classification performance but attention patterns that do not align with expectations. Based on these observations, we propose to use ensembling to minimize the risk of misleading attention-based explanations.
Wide Attention Is The Way Forward For Transformers? Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D. Mullins
Widening the attention layer in a Transformer and only using a single layer is surprisingly effective, with a number of advantages.
The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We therefore put forward wider and shallower models as a viable and desirable alternative for small models on NLP tasks, and as an important area of research for domains beyond this.

Poster Presentations

Attention as inference with third-order interactions Yicheng Fei, Xaq Pitkow
In neuroscience, attention has been associated operationally with enhanced processing of certain sensory inputs depending on external or internal contexts such as cueing, salience, or mental states. In machine learning, attention usually means a multiplicative mechanism whereby the weights in a weighted summation of an input vector are calculated from the input itself or some other context vector. In both scenarios, attention can be conceptualized as a gating mechanism. In this paper, we argue that three-way interactions serve as a normative way to define a gating mechanism in generative probabilistic graphical models. By going a step beyond pairwise interactions, it empowers much more computational efficiency, like a transistor expands possible digital computations. Models with three-way interactions are also easier to scale up and thus to implement biologically. As an example application, we show that a graphical model with three-way interactions provides a normative explanation for divisive normalization in macaque primary visual cortex, an operation adopted widely throughout the cortex to reduce redundancy, save energy, and improve computation.
Attention for Compositional Modularity Oleksiy Ostapenko, Pau Rodriguez, Alexandre Lacoste, Laurent Charlin
In this work we studied different attention-based module selection aproaches for computational modularity.
Modularity and compositionality are promising inductive biases for addressing longstanding problems in machine learning such as better systematic generalization, as well as better transfer and lower forgetting in the context of continual learning. Here we study how attention-based module selection can help achieve compositonal modularity – i.e. decomposition of tasks into meaningful sub-tasks which are tackled by independent architectural entities that we call modules. These sub-tasks must be reusable and the system should be able to learn them without additional supervision. We design a simple experimental setup in which the model is trained to solve mathematical equations with multiple math operations applied sequentially. We study different attention-based module selection strategies, inspired by the principles introduced in the recent literature. We evaluate the method’s ability to learn modules that can recover the underling sub-tasks (operation) used for data generation, as well as the ability to generalize compositionally. We find that meaningful module selection (i.e. routing) is the key to compositional generalization. Further, without access to the privileged information about which part of the input should be used for module selection, the routing component performs poorly for samples that are compositionally out of training distribution. We find that the the main reason for this lies in the routing component, since many of the tested methods perform well OOD if we report the performance of the best performing path at test time. Additionally, we study the role of the number of primitives, the number of training points and bottlenecks for modular specialization.
Bounded logit attention: Learning to explain image classifiers Thomas Baumhauer, Djordje Slijepcevic, Matthias Zeppelzauer
We present a trainable self-explanation module for convolutional neural networks based on an attention mechanism using a novel type of activation function.
Explainable artificial intelligence is the attempt to elucidate the workings of systems too complex to be directly accessible to human cognition through suitable sideinformation referred to as “explanations”. We present a trainable explanation module for convolutional image classifiers we call bounded logit attention (BLA). The BLA module learns to select a subset of the convolutional feature map for each input instance, which then serves as an explanation for the classifier’s prediction. BLA overcomes several limitations of the instancewise feature selection method “learning to explain” (L2X) introduced by Chen et al. (2018): 1) BLA scales to real-world sized image classification problems, and 2) BLA offers a canonical way to learn explanations of variable size. Due to its modularity BLA lends itself to transfer learning setups and can also be employed as a post-hoc add-on to trained classifiers. Beyond explainability, BLA may serve as a general purpose method for differentiable approximation of subset selection. In a user study we find that BLA explanations are preferred over explanations generated by the popular (Grad-)CAM method (Zhou et al., 2016; Selvaraju et al., 2017).
Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers Alexander Wong, Mohammad Javad Shafiee, Saad Abbasi, Saeejith Nair, Mahmoud Famouri
With the growing adoption of deep learning for on-device TinyML applications, there has been an ever-increasing demand for more efficient neural network backbones optimized for the edge. Recently, the introduction of attention condenser networks have resulted in low-footprint, highly-efficient, self-attention neural networks that strike a strong balance between accuracy and speed. In this study, we introduce a new faster attention condenser design called double-condensing attention condensers that enable more condensed feature embedding. We further employ a machine-driven design exploration strategy that imposes best practices design constraints for greater efficiency and robustness to produce the macro-micro architecture constructs of the backbone. The resulting backbone (which we name \textbf{AttendNeXt}) achieves significantly higher inference throughput on an embedded ARM processor when compared to several other state-of-the-art efficient backbones ($>10\times$ faster than FB-Net C at higher accuracy and speed and $>10\times$ faster than MobileOne-S1 at smaller size) while having a small model size ($>1.37\times$ smaller than MobileNetv3-L at higher accuracy and speed) and strong accuracy (1.1\% higher top-1 accuracy than MobileViT XS on ImageNet at higher speed). These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting Xiyuan Zhang, Xiaoyong Jin, Karthick Gopalswamy, Gaurav Gupta, Youngsuk Park, Xingjian Shi, Hao Wang, Danielle C. Maddix, Bernie Wang
We theoretically and empirically analyze relationships between variants of attention models in time-series forecasting, and propose a decomposition-based hybrid method that achieves better performance than current attention models.
Transformer-based models have gained large popularity and demonstrated promising results in long-term time-series forecasting in recent years. In addition to learning attention in time domain, recent works also explore learning attention in frequency domains (e.g., Fourier domain, wavelet domain), given that seasonal patterns can be better captured in these domains. In this work, we seek to understand the relationships between attention models in different time and frequency domains. Theoretically, we show that attention models in different domains are equivalent under linear conditions (i.e., linear kernel to attention scores). Empirically, we analyze how attention models of different domains show different behaviors through various synthetic experiments with seasonality, trend and noise, with emphasis on the role of softmax operation therein. Both these theoretical and empirical analyses motivate us to propose a new method: TDformer (Trend Decomposition Transformer), that first applies seasonal-trend decomposition, and then additively combines an MLP which predicts the trend component with Fourier attention which predicts the seasonal component to obtain the final prediction. Extensive experiments on benchmark time-series forecasting datasets demonstrate that TDformer achieves state-of-the-art performance against existing attention-based models.
FuzzyNet: A Fuzzy Attention Module for Polyp Segmentation Krushi Bharatbhai Patel, Fengjun Li, Guanghui Wang
A Fuzzy attention module to focus more on hard pixels lying around the boundary region of the polyp.
Polyp segmentation is essential for accelerating the diagnosis of colon cancer. However, it is challenging because of the diverse color, texture, and varying lighting effects of the polyps as well as the subtle difference between the polyp and its surrounding area. To further increase the performance of polyp segmentation, we propose to focus more on the problematic pixels that are harder to predict. To this end, we propose a novel attention module named Fuzzy Attention to focus more on the difficult pixels. Our attention module generates a high attention score for fuzzy pixels usually located near the boundary region. This module can be embedded in any convolution neural network-based backbone network. We embed our module with various backbone networks: Res2Net, ConvNext and Pyramid Vision Transformer and evaluate the models on five polyp segmentation datasets: Kvasir, CVC-300, CVC-ColonDB, CVC-ClinicDB, and ETIS. Our attention module with Res2Net as the backbone network outperforms the reverse attention-based PraNet by a significant amount on all datasets. In addition, our module with PVT as the backbone network achieves state-of-the-art accuracy of 0.937, 0.811, and 0.791 on the CVC-ClinicDB, CVC-ColonDB, and ETIS, respectively, outperforming the latest SA-Net, TransFuse and Polyp-PVT.
Graph Attention for Spatial Prediction Corban Rivera, Ryan W. Gardner
We introduced an allocentric graph attention approach for spatial reasoning and object localization
Imbuing robots with human-levels of intelligence is a longstanding goal of AI research. A critical aspect of human-level intelligence is spatial reasoning. Spatial reasoning requires a robot to reason about relationships among objects in an environment to estimate the positions of unseen objects. In this work, we introduced a novel graph attention approach for predicting the locations of query objects in partially observable environments. We found that our approach achieved state of the art results on object location prediction tasks. Then, we evaluated our approach on never before seen objects, and we observed zero-shot generalization to estimate the positions of new object types.
Improving cross-modal attention via object detection Yongil Kim, Yerin Hwang, Seunghyun Yoon, Hyeongu Yun, Kyomin Jung
Cross-modal attention is widely used in multimodal learning to fuse information from two modalities. However, most existing models only assimilate cross-modal attention indirectly by relying on end-to-end learning and do not directly improve the attention mechanisms. In this paper, we propose a methodology for directly enhancing cross-modal attention by utilizing object-detection models for vision-and-language tasks that deal with image and text information. We used the mask of the detected objects obtained by the detection model as a pseudo label, and we added a loss between the attention map of the multimodal learning model and the pseudo label. The proposed methodology drastically improves the performance of the baseline model across all performance metrics in various popular datasets for the image-captioning task. Moreover, our highly scalable methodology can be applied to any multimodal task in terms of vision-and-language.
Quantifying attention via dwell time and engagement in a social media browsing environment Ziv Epstein, Hause Lin, Gordon Pennycook, David Rand
We propose a two-stage model of attention for social media environments that disentangles engagement and dwell, and show that attention operates differently in these two stages via a dissociation.
Modern computational systems have an unprecedented ability to detect, leverage and influence human attention. Prior work identified user engagement and dwell time as two key metrics of attention in digital environments, but these metrics have yet to be integrated into a unified model that can advance the theory and practice of digital attention. We draw on work from cognitive science, digital advertising, and AI to propose a two-stage model of attention for social media environments that disentangles engagement and dwell. In an online experiment, we show that attention operates differently in these two stages and find clear evidence of dissociation: when dwelling on posts (Stage 1), users attend more to sensational than credible content, but when deciding whether to engage with content (Stage 2), users attend more to credible than sensational content. These findings have implications for the design and development of computational systems that measure and model human attention, such as newsfeed algorithms on social media.
Revisiting Attention Weights as Explanations from an Information Theoretic Perspective Bingyang Wen, Koduvayur Subbalakshmi, Fan Yang
This work evaluates the interpretability of attention weights and the results show that attention weights have the potential to be used as model's explanation.
Attention mechanisms have recently demonstrated impressive performance on a range of NLP tasks, and attention scores are often used as a proxy for model explainability. However, there is a debate on whether attention weights can, in fact, be used to identify the most important inputs to a model. We approach this question from an information theoretic perspective by measuring the mutual information between the model output and the hidden states. From extensive experiments, we draw the following conclusions: (i) Additive and Deep attention mechanisms are likely to be better at preserving the information between the hidden states and the model output (compared to Scaled Dot-product); (ii) ablation studies indicate that Additive attention can actively learn to explain the importance of its input hidden representations; (iii) when attention values are nearly the same, the rank order of attention values is not consistent with the rank order of the mutual information (iv) Using Gumbel-Softmax with a temperature lower than one, tends to produce a more skewed attention score distribution compared to softmax and hence is a better choice for explainable design; (v) some building blocks are better at preserving the correlation between the ordered list of mutual information and attention weights order (for eg. the combination of BiLSTM encoder and Additive attention). Our findings indicate that attention mechanisms do have the potential to function as a shortcut to model explanations when they are carefully combined with other model elements.
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks Yuxuan Li, James McClelland
We present a causal transformer that learns structured, algorithmic tasks and generalizes to longer sequences, and unpack its computation in relation to task structures by analyzing attention patterns and latent representations.
Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional inputs. However, there is an ongoing debate about how and when transformers can acquire highly structured behavior and achieve systematic generalization. Here, we explore how well a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions of these operations. We demonstrate strong generalization to sequences longer than those used in training by replacing the standard positional encoding typically used in transformers with labels arbitrarily paired with items in the sequence. We searched for the layer and head configuration sufficient to solve the task, and performed attention ablation and analyzed encoded representations. We show that two-layer transformers learn generalizable solutions to multi-level problems, develop signs of systematic task decomposition, and exploit shared computation across related tasks. These results provide key insights into the possible structures of within-task and cross-task computations that stacks of attention layers can afford.
TDLR: Top Semantic-Down Syntactic Language Representation Vipula Rawte, Megha Chakraborty, Kaushik Roy, Manas Gaur, Keyur Faldu, Prashant Kikani, Hemang Akbari, Amit P. Sheth
TDLR framework to infuse knowledge (common-sense) in language models in a top-down (semantic-syntactic) manner.
Language understanding involves processing text with both the grammatical and common-sense contexts of the text fragments. The text "I went to the grocery store and brought home a car" requires both the grammatical context (syntactic) and common-sense context (semantic) to capture the oddity in the sentence. Contextualized text representations learned by Language Models (LMs) are expected to capture a variety of syntactic and semantic contexts from large amounts of training data corpora. Recent work such as ERNIE has shown that infusing the knowledge contexts, where they are available in LMs, results in significant performance gains on General Language Understanding (GLUE) benchmark tasks. However, to our knowledge, no knowledge-aware model has attempted to infuse knowledge through top-down semantics-driven syntactic processing (Eg: Common-sense to Grammatical) and directly operated on the attention mechanism that LMs leverage to learn the data context. We propose a learning framework Top-Down Language Representation (TDLR) to infuse common-sense semantics into LMs. In our implementation, we build on BERT for its rich syntactic knowledge and use the knowledge graphs ConceptNet and WordNet to infuse semantic knowledge.
The Paradox of Choice: On the Role of Attention in Hierarchical Reinforcement Learning Andrei Cristian Nica, Khimya Khetarpal, Doina Precup
We characterize affordances as a hard-attention mechanism in hierarchical RL and investigate the role of hard versus soft attention in different scenarios, empirically demonstrating the "paradox of choice".
Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to having many choices. Hierarchical reinforcement learning methods aim to solve the first problem, by providing shortcuts that skip over multiple time steps. To cope with the breadth, it is desirable to restrict the agent's attention at each step to a reasonable number of possible choices. The concept of affordances (Gibson, 1977) suggests that only certain actions are feasible in certain states. In this work, we first characterize "affordances" as a "hard" attention mechanism that strictly limits the available choices of temporally extended options. We then investigate the role of hard versus soft attention in training data collection, abstract value learning in long-horizon tasks, and handling a growing number of choices. To this end, we present an online, model-free algorithm to learn affordances that can be used to further learn subgoal options. Finally, we identify and empirically demonstrate the settings in which the "paradox of choice" arises, i.e. when having fewer but more meaningful choices improves the learning speed and performance of a reinforcement learning agent.
Unlocking Slot Attention by Changing Optimal Transport Costs Yan Zhang, David W Zhang, Simon Lacoste-Julien, Gertjan J. Burghouts, Cees G. M. Snoek
Slot attention can do tiebreaking by changing the costs for optimal transport to minimize entropy, which improves results significantly on object detection
Slot attention is a successful method for object-centric modeling with images and videos for tasks like unsupervised object discovery. However, set-equivariance limits its ability to perform tiebreaking, which makes distinguishing similar structures difficult – a task crucial for vision problems. To fix this, we cast cross-attention in slot attention as an optimal transport (OT) problem that has solutions with the desired tiebreaking properties. We then propose an entropy minimization module that combines the tiebreaking properties of unregularized OT with the speed of regularized OT. We evaluate our method on CLEVR object detection and observe significant improvements from 53% to 91% on a strict average precision metric.


Akanksha Saran
Microsoft Research
Abhijat Biswas
Carnegie Mellon University
Khimya Khetarpal
McGill University / MILA, Montreal
Reuben Aronson
Tufts University
Ruohan Zhang
Stanford University
Grace Lindsay
University College London / New York University
Scott Niekum
University of Texas, Austin / University of Massachusetts

Relevant Publications

CogSci and Neuroscience

Mary M. Hayhoe, Dana H. Ballard (2005). Eye Movements in Natural Behavior. Trends in Cognitive Sciences. Link

Grace W. Lindsay (2020). Attention in Psychology, Neuroscience, and Machine Learning. Front. Comput. Neurosci.. Link

Machine Learning, Deep Learning and Reinforcement Learning

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). Attention is All You Need. NeurIPS. Link

Ruohan Zhang, Akanksha Saran, Bo Liu, Yifeng Zhu, Sihang Guo, Scott Niekum, Dana H. Ballard, Mary M. Hayhoe (2020). Human Gaze Assisted Artificial Intelligence – A Review. IJCAI. Link

Jürgen Schmidhuber (2020). Neural nets learn to program neural nets with fast weights—like today’s Transformer variants. AI Blog, Jürgen Schmidhuber. Link

Jürgen Schmidhuber (2021). End-to-End Differentiable Sequential Neural Attention. AI Blog, Jürgen Schmidhuber. Link

Akanksha Saran, Ruohan Zhang, Elaine S. Short, and Scott Niekum (2021). Efficiently Guiding Imitation Learning Agents with Human Gaze. AAMAS. Link

Haiping Wu, Khimya Khetarpal, and Doina Precup (2021). Self-supervised Attention-aware Reinforcement Learning. AAAI. Link

Alex Lamb, Riashat Islam, Yonathan Efroni, Aniket Didolkar, Dipendra Misra, Dylan Foster, Lekan Molu, Rajan Chari, Akshay Krishnamurthy, John Langford (2022). Guaranteed Discovery of Controllable Latent States with Multi-Step Inverse Models. arXiv. Link

Khimya Khetarpal, and Doina Precup (2018). Attend before you act – Leveraging human visual attention for continual learning. ICML Workshop on Lifelong Learning - A Reinforcement Learning Approach. Link

Stephanie Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K. Lampinen, and Felix Hill (2022). Transformers generalize differently from information stored in context vs in weights. arXiv. Link

Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, Volodymyr Mnih (2022). In-context Reinforcement Learning with Algorithm Distillation. arXiv. Link

Jason Ross Brown, Yiren Zhao, Ilia Shumailov, and Robert D. Mullins. (2022). Wide Attention Is The Way Forward For Transformers. arXiv. Link

Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang (2022). Transformers Learn Shortcuts to Automata. arXiv. Link

HCI, HRI and Robotics

Reuben M Aronson, Thiago Santini, Thomas C Kübler, Enkelejda Kasneci, Siddhartha Srinivasa, Henny Admoni (2018). Eye-hand behavior in human-robot shared manipulation. HRI. Link

Akanksha Saran, Elaine Schaertl Short, Andrea Thomaz, and Scott Niekum (2019). Understanding Teacher Gaze Patterns for Robot Learning. PMLR. Link

Reuben M Aronson, Henny Admoni (2020). Eye Gaze for Assistive Manipulation. HRI. Link

Reuben M. Aronson, Nadia Almutlak, and Henny Admoni (2021). Inferring Goals with Gaze during Teleoperated Manipulation. IROS. Link

Reuben M Aronson, Henny Admoni (2022). Gaze Complements Control Input for Goal Prediction During Assisted Teleoperation. RSS. Link

PS: The above references are only representative of relevant work and are not meant to be exhaustive. Please feel free to create a pull request for adding a relevant paper to the references above to the website github repository.