All Things Attention: Bridging Different Perspectives on Attention
Generated by DALL-E
In conjunction with NeurIPS 22: December 2, 2022. Hybrid in-person and virtual.
Attention is a widely popular topic studied in many fields such as neuroscience, psychology, and machine learning. A better understanding and conceptualization of attention in both humans and machines has led to significant progress across fields. At the same time, attention is far from a clear or unified concept, with many definitions within and across multiple fields.
Cognitive scientists study how the brain flexibly controls its limited computational resources to accomplish its objectives. Inspired by cognitive attention, machine learning researchers introduce attention as an inductive bias in their models to improve performance or interpretability. Human-computer interaction designers monitor people’s attention during interactions to implicitly detect aspects of their mental states.
While the aforementioned research areas all consider attention, each formalizes and operationalizes it in different ways. Bridging this gap will facilitate:
- (Cogsci for AI) More principled forms of attention in AI agents towards more human-like abilities such as robust generalization, quicker learning and faster planning.
- (AI for cogsci) Developing better computational models for modeling human behaviors that involve attention.
- (HCI) Modeling attention during interactions from implicit signals for fluent and efficient coordination
- (HCI/ML) Artificial models of algorithmic attention to enable intuitive interpretations of deep models?
Topics of Interest
The All Things Attention workshop aims to foster connections across disparate academic communities that conceptualize “attention” such as Neuroscience, Psychology, Machine Learning, and Human Computer Interaction. Workshop topics of interest include:
Relationships between biological and artificial attention
- What are the connections between forms of attention in the human brain and deep neural network architectures?
- Can the anatomy of human attention models inspire designs of architectures for artificial systems?
- Given the same task and learning objective, how do learned attention mechanisms in machines differ from those in humans?
Attention for reinforcement learning and decision making
- How have reinforcement learning agents leveraged attention in decision making?
- Do decision-making agents today have implicit or explicit formalisms of attention?
- How can AI agents build notions of attention without explicitly baked in notions of attention?
- Can attention significantly enable AI agents to scale e.g. through gains in sample efficiency, and generalization?
- How should learning systems reason about computational attention (which parts of sensed inputs to focus computation on)?
Attention mechanisms for continual / lifelong learning
- How can continual learning agents use attention to maintain already-learned knowledge?
- How can attention control the amount of interference between different inputs?
- How does the executive control of attention evolve with learning in humans?
- How does understanding the development of human attentional systems in infancy and childhood explain how attention can be learned in artificial systems?
Attention for interpretation and explanation
- How can attention models aid visualization?
- How is attention used for interpretability in AI?
- What are the major bottlenecks and common pitfalls in applying attention methods for explaining the decisions of AI agents?
Attention in human-computer interaction
- How do we detect aspects of human attention during interactions, from sensing to processing to representations?
- What systems benefit from human attention modeling, and how do they use these models?
- How can systems influence a user’s attention, and what systems benefit from this capability?
- How can a system communicate or simulate its own attention (humanlike or algorithmic) in an interaction, and to what benefit?
Attention mechanisms in Deep Neural Network (DNN) architectures
- How does attention in DNNs such as transformers relate to existing formalisms of attention in cogsci/psychology?
- How does self-attention in transformers contribute to its vast success in recent models such as GPT2, GPT3, DALLE?
- How can an understanding of attention from other fields inspire future DNN research?
Ways to Participate
Recording of the event is available here.
Questions
Ask questions on slido: Use this link or the embedded page here:
RocketChat and NeurIPS workshop page
You can participate in live discussions during the workshop here. Invited speakers will also be encouraged to answer questions offline using the same link. Please note you need to be registered for the workshop to access RocketChat.
Virtual poster session and hangouts
The virtual poster session will be held on the Discord server — see the NeurIPS page for a link. It’ll be open for the whole day, so feel free to jump in just to chat!
Schedule
Time in CST |
Event |
9:00 AM - 11:00 AM
9:00 AM - 9:05 AM
9:05 AM - 9:25 AM
9:25 AM - 9:45 AM
9:45 AM - 10:05 AM
10:05 AM - 10:25 AM
10:25 AM - 11:00 AM
|
Talks Session I
Workshop Intro
Ida Momennejad
James Whittington
Henny Admoni
Tobias Gerstenberg
Spotlight Talks:
- Foundations of Attention Mechanisms in Deep Neural Network Architectures
- Is Attention Interpretation? A Quantitative Assessment On Sets
|
11:00 AM - 12:00 PM
|
In-Person Panel Discussion
Panelists: Megan deBettencourt, Tobias Gerstenberg, Erin Grant, Ida Momennejad, Ramakrishna Vedantam, James Whittington, Cyril Zhang
|
12:00 PM - 1:00 PM
|
Lunch and Virtual Social Event
|
1:00 PM - 2:00 PM
|
Coffee Break / Poster Session
|
2:00 PM - 4:00 PM
2:00 PM - 2:20 PM
2:20 PM - 2:40 PM
2:40 PM - 3:00 PM
3:00 PM - 3:20 PM
3:20 PM - 4:00 PM
|
Talks Session II
Shalini De Mello
Pieter Roelfsema
Erin Grant
Vidhya Navalpakkam
Spotlight Talks:
- Wide Attention Is The Way Forward For Transformers
- Fine-tuning hierarchical circuits through learned stochastic co-modulation
- Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement
|
4:00 PM - 5:00 PM
|
Coffee Break / Poster Session
|
5:00 PM - 6:00 PM
|
Virtual Panel Discussion
Panelists: Henny Admoni, David Ha, Brian Kingsbury, John Langford, Shalini De Mello, Vidhya Navalpakkam, Ashish Vaswani
|
Invited Speakers
Attention in Task-sets, Planning, and the Prefrontal Cortex
What we pay attention to depends on the context and the task at hand. On the one hand, the prefrontal cortex can modulate how to direct attention outward to the external world. On the other hand, attention to internal states enables metacognition and configuration of internal states using repertoires of memories and skills. I will first discuss ongoing work in which, inspired by the role of attention in affordances and task-sets, we analyze large scale game play data in the XboX 3D game Bleeding Edge in an interpretable way. I will briefly mention ongoing directions including decoding of plans during chess based on eye-tracking. I will conclude with how future models of multi-scale predictive representations could include prefrontal cortical modulation during planning and task performance.
Relating Transformers to Models and Neural Representations of the Hippocampal Formation
Many deep neural network architectures loosely based on brain networks have recently been shown to replicate neural firing patterns observed in the brain. One of the most exciting and promising novel architectures, the Transformer neural network, was developed without the brain in mind. In this work, we show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience. We additionally show the transformer version offers dramatic performance gains over the neuroscience version. This work continues to bind computations of artificial and brain networks, offers a novel understanding of the hippocampal-cortical interaction, and suggests how wider cortical areas may perform complex tasks beyond current neuroscience models such as language comprehension.
Henny Admoni
A. Nico Habermann Assistant Professor, Carnegie Mellon University
Eye Gaze in Human-Robot Collaboration
In robotics, human-robot collaboration works best when robots are responsive to their human partners’ mental states. Human eye gaze has been used as a proxy for one such mental state: attention. While eye gaze can be a useful signal, for example enabling intent prediction, it is also a noisy one. Gaze serves several functions beyond attention, and thus recognizing what people are attending to from their eye gaze is a complex task. In this talk, I will discuss our research on modeling eye gaze to understand human attention in collaborative tasks such as shared manipulation and assisted driving.
Attending to What's Not There
When people make sense of the world, they don’t only pay attention to what’s actually happening. Their mind also takes them to counterfactual worlds of what could have happened. In this talk, I will illustrate how we can use eye-tracking to uncover the human mind’s forays into the imaginary. I will show that when people make causal judgments about physical interactions, they don’t just look at what actually happens. They mentally simulate what would have happened in relevant counterfactual situations to assess whether the cause made a difference. And when people try to figure out what happened in the past, they mentally simulate the different scenarios that could have led to the outcome. Together these studies illustrate how attention is not only driven by what’s out there in the world, but also by what’s hidden inside the mind.
Exploiting Human Interactions to Learn Human Attention
Unconstrained eye gaze estimation using ordinary webcams in smart phones and tablets is immensely useful for many applications. However, current eye gaze estimators are limited in their ability to generalize to a wide range of unconstrained conditions, including, head poses, eye gaze angles and lighting conditions, etc. This is mainly due to the lack of availability of gaze training data in in-the-wild conditions. Notably, eye gaze is a natural form of human communication while humans interact with each other. Visual data (videos or images) containing human interaction are also abundantly available on the internet and are constantly growing as people upload more. Could we leverage visual data containing human interaction to learn unconstrained gaze estimators? In this talk we will describe our foray into addressing this challenging problem. Our findings point to the great potential of human interaction data as a low cost and ubiquitously available source of training data for unconstrained gaze estimators. By lessening the burden of specialized data collection and annotation, we hope to foster greater real-word adoption and proliferation of gaze estimation technology in end-user devices.
BrainProp: How Attentional Processes in the Brain Solve the Credit Assignment Problem
Humans and many other animals have an enormous capacity to learn about sensory stimuli and to master new skills. Many of the mechanisms that enable us to learn remain to be understood. One of the greatest challenges of systems neuroscience is to explain how synaptic connections change to support maximally adaptive behaviour. We will provide an overview of factors that determine the change in the strength of synapses. Specifically, we will discuss the influence of attention, neuromodulators and feedback connections in synaptic plasticity and suggest a specific framework, called BrainProp, in which these factors interact to improve the functioning of the entire network.
Much recent work focuses on learning in the brain using presumed biologically plausible variants of supervised learning algorithms. However, the biological plausibility of these approaches is limited, because there is no teacher in the motor cortex that instructs the motor neurons. Instead, learning in the brain usually depends on reward and punishment. BrainProp is a biologically plausible reinforcement learning scheme for deep networks with an any number of layers. The network chooses an action by selecting a unit in the output layer and uses feedback connections to assign credit to the units in lower layers that are responsible for this action. After the choice, the network receives reinforcement so that there is no need for a teacher. We showed how BrainProp is mathematically equivalent to error backpropagation, for one output unit at a time (Pozzi et al., 2020). We illustrate learning of classical and hard image-classification benchmarks (MNIST, CIFAR10, CIFAR100 and Tiny ImageNet) by deep networks. BrainProp achieves an accuracy that is equivalent to that of standard error-backpropagation, and better than other state-of-the-art biologically inspired learning schemes. Additionally, the trial-and-error nature of learning is associated with limited additional training time so that BrainProp is a factor of 1-3.5 times slower. These results provide new insights into how deep learning may be implemented in the brain.
Erin Grant
Senior Research Fellow, University College London
Attention as Interpretable Information Processing in Machine Learning Systems
Attention in psychology and neuroscience conceptualizes how the human mind prioritizes information as a result of limited resources. Machine learning systems do not necessarily share the same limits, but implementations of attention have nevertheless proven useful in machine learning across a broad set of domains. Why is this so? I will focus on one aspect: interpretability, which is an ongoing challenge for machine learning systems. I will discuss two different implementations of attention in machine learning that tie closely to conceptualizations of attention in two domains of psychological research. Using these case studies as a starting point, I will discuss the broader strengths and drawbacks of using attention to constrain and interpret how machine learning systems process information. I will end with a problem statement highlighting the need to move away from localized notions to a global view of how attention-like mechanisms modulate information processing in artificial systems.
Accelerating Human Attention Research via ML Applied to Smartphones
Attention and eye movements are thought to be a window to the human mind, and have been extensively studied across Neuroscience, Psychology and HCI. However, progress in this area has been severely limited as the underlying methodology relies on specialized hardware that is expensive (upto $30,000) and hard to scale. In this talk, I will present our recent work from Google, which shows that ML applied to smartphone selfie cameras can enable accurate gaze estimation, comparable to state-of-the-art hardware based devices, at 1/100th the cost and without any additional hardware. Via extensive experiments, we show that our smartphone gaze tech can successfully replicate key findings from prior hardware-based eye movement research in Neuroscience and Psychology, across a variety of tasks including traditional oculomotor tasks, saliency analyses on natural images and reading comprehension. We also show that smartphone gaze could enable applications in improved health/wellness, for example, as a potential digital biomarker for detecting mental fatigue. These results show that smartphone-based attention has the potential to unlock advances by scaling eye movement research, and enabling new applications for improved health, wellness and accessibility, such as gaze-based interaction for patients with ALS/stroke that cannot otherwise interact with devices.
Panelists
Henny Admoni
A. Nico Habermann Assistant Professor, Carnegie Mellon University
Erin Grant
Senior Research Fellow, University College London
Accepted Papers
Oral Presentations
Fine-tuning hierarchical circuits through learned stochastic co-modulation
Caroline Haimerl, Eero P Simoncelli, Cristina Savin
Targeted stochastic co-modulation in the brain introduces a label of task-relevant information that can help fine-tune a hierarchical model of the visual system for a new task
Attentional gating is a core mechanism supporting behavioral flexibility, but its biological implementation remains uncertain. Gain modulation of neural responses is likely to play a key role, but simply boosting relevant neural responses can be insufficient for improving behavioral outputs, especially in hierarchical circuits. Here we propose a variation of attentional gating that relies on {\em stochastic} gain modulation as a dedicated indicator of task relevance, which guides task-specific readout adaptation. We show that targeted stochastic modulation can be effectively learned and used to fine-tune hierarchical architectures, without reorganization of the underlying circuits. Simulations of such networks demonstrate improvements in learning efficiency and performance in novel tasks, relative to traditional attentional mechanisms based on deterministic gain increases. The effectiveness of this approach relies on the availability of representational bottlenecks in which the task relevant information is localized in small subpopulations of neurons. Overall, this work provides a new mechanism for constructing intelligent systems that can flexibly and robustly adapt to changes in task structure.
Foundations of Attention Mechanisms in Deep Neural Network Architectures
Pierre Baldi, Roman Vershynin
We classify all attention mechanisms, identify the most important one, and prove several theorems about their capacity.
We consider the foundations of attention mechanisms in deep neural network architectures and present three main results. First, we provide a systematic taxonomy of all possible attention mechanisms within, or as extensions of, the McCulloch and Pitt standard model into 18 classes depending on the origin type of the attention signal, the target type of the attention signal, and whether the interaction type is additive or multiplicative. Second, using this taxonomy, we identify three key attention mechanisms: output gating, synaptic gating, and multiplexing. Output gating and synaptic gating are extensions of the standard model and all current attention-based architectures, including transformers, use either output gating or synaptic gating, or a combination of both. Third, we develop a theory of attention capacity and derive mathematical results about the capacity of basic attention networks. For example, the output gating of a linear threshold gate of $n$ variables by another linear threshold gate of the same $n$ variables has capacity $2n^2 (1+o(1))$. Perhaps surprisingly, multiplexing attention is used in the proofs of these results. Synaptic and output gating provide computationally efficient extensions of the standard model allowing for {\it sparse} quadratic activation functions. They can also be viewed as primitives enabling the concise collapsing of multiple layers of processing in the standard model.
Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement
Michael Chang, Alyssa Li Dayan, Franziska Meier, Thomas L. Griffiths, Sergey Levine, Amy Zhang
We demonstrate how to generalize over a combinatorially large space of rearrangement tasks from only pixel observations by constructing from video demonstrations a factorized transition graph over entity state transitions that we use for control.
Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks.
Is Attention Interpretation? A Quantitative Assessment On Sets
Jonathan Haab, Nicolas Deutschmann, Maria Rodriguez Martinez
We test the interpretability of attention weights by designing Multiple Instance Learning synthetic datasets with ground-truth instance-level labels.
The debate around the interpretability of attention mechanisms is centered on whether attention scores can be used as a proxy for the relative amounts of signal carried by sub-components of data. We propose to study the interpretability of attention in the context of set machine learning, where each data point is composed of an unordered collection of instances with a global label. For classical multiple-instance-learning problems and simple extensions, there is a well-defined “importance” ground truth that can be leveraged to cast interpretation as a binary classification problem, which we can quantitatively evaluate. By building synthetic datasets over several data modalities, we perform a systematic assessment of attention-based interpretations. We find that attention distributions are indeed often reflective of the relative importance of individual instances, but that silent failures happen where a model will have high classification performance but attention patterns that do not align with expectations. Based on these observations, we propose to use ensembling to minimize the risk of misleading attention-based explanations.
Wide Attention Is The Way Forward For Transformers?
Jason Ross Brown, Yiren Zhao, Ilia Shumailov, Robert D. Mullins
Widening the attention layer in a Transformer and only using a single layer is surprisingly effective, with a number of advantages.
Poster Presentations
Attention as inference with third-order interactions
Yicheng Fei, Xaq Pitkow
In neuroscience, attention has been associated operationally with enhanced processing of certain sensory inputs depending on external or internal contexts such as cueing, salience, or mental states. In machine learning, attention usually means a multiplicative mechanism whereby the weights in a weighted summation of an input vector are calculated from the input itself or some other context vector. In both scenarios, attention can be conceptualized as a gating mechanism. In this paper, we argue that three-way interactions serve as a normative way to define a gating mechanism in generative probabilistic graphical models. By going a step beyond pairwise interactions, it empowers much more computational efficiency, like a transistor expands possible digital computations. Models with three-way interactions are also easier to scale up and thus to implement biologically. As an example application, we show that a graphical model with three-way interactions provides a normative explanation for divisive normalization in macaque primary visual cortex, an operation adopted widely throughout the cortex to reduce redundancy, save energy, and improve computation.
Attention for Compositional Modularity
Oleksiy Ostapenko, Pau Rodriguez, Alexandre Lacoste, Laurent Charlin
In this work we studied different attention-based module selection aproaches for computational modularity.
Modularity and compositionality are promising inductive biases for addressing longstanding problems in machine learning such as better systematic generalization, as well as better transfer and lower forgetting in the context of continual learning. Here we study how attention-based module selection can help achieve compositonal modularity – i.e. decomposition of tasks into meaningful sub-tasks which are tackled by independent architectural entities that we call modules. These sub-tasks must be reusable and the system should be able to learn them without additional supervision. We design a simple experimental setup in which the model is trained to solve mathematical equations with multiple math operations applied sequentially. We study different attention-based module selection strategies, inspired by the principles introduced in the recent literature. We evaluate the method’s ability to learn modules that can recover the underling sub-tasks (operation) used for data generation, as well as the ability to generalize compositionally. We find that meaningful module selection (i.e. routing) is the key to compositional generalization. Further, without access to the privileged information about which part of the input should be used for module selection, the routing component performs poorly for samples that are compositionally out of training distribution. We find that the the main reason for this lies in the routing component, since many of the tested methods perform well OOD if we report the performance of the best performing path at test time. Additionally, we study the role of the number of primitives, the number of training points and bottlenecks for modular specialization.
Bounded logit attention: Learning to explain image classifiers
Thomas Baumhauer, Djordje Slijepcevic, Matthias Zeppelzauer
We present a trainable self-explanation module for convolutional neural networks based on an attention mechanism using a novel type of activation function.
Explainable artificial intelligence is the attempt to elucidate the workings of systems too complex to be directly accessible to human cognition through suitable sideinformation referred to as “explanations”. We present a trainable explanation module for convolutional image classifiers we call bounded logit attention (BLA). The BLA module learns to select a subset of the convolutional feature map for each input instance, which then serves as an explanation for the classifier’s prediction. BLA overcomes several limitations of the instancewise feature selection method “learning to explain” (L2X) introduced by Chen et al. (2018): 1) BLA scales to real-world sized image classification problems, and 2) BLA offers a canonical way to learn explanations of variable size. Due to its modularity BLA lends itself to transfer learning setups and can also be employed as a post-hoc add-on to trained classifiers. Beyond explainability, BLA may serve as a general purpose method for differentiable approximation of subset selection. In a user study we find that BLA explanations are preferred over explanations generated by the popular (Grad-)CAM method (Zhou et al., 2016; Selvaraju et al., 2017).
Faster Attention Is What You Need: A Fast Self-Attention Neural Network Backbone Architecture for the Edge via Double-Condensing Attention Condensers
Alexander Wong, Mohammad Javad Shafiee, Saad Abbasi, Saeejith Nair, Mahmoud Famouri
With the growing adoption of deep learning for on-device TinyML applications, there has been an ever-increasing demand for more efficient neural network backbones optimized for the edge. Recently, the introduction of attention condenser networks have resulted in low-footprint, highly-efficient, self-attention neural networks that strike a strong balance between accuracy and speed. In this study, we introduce a new faster attention condenser design called double-condensing attention condensers that enable more condensed feature embedding. We further employ a machine-driven design exploration strategy that imposes best practices design constraints for greater efficiency and robustness to produce the macro-micro architecture constructs of the backbone. The resulting backbone (which we name \textbf{AttendNeXt}) achieves significantly higher inference throughput on an embedded ARM processor when compared to several other state-of-the-art efficient backbones ($>10\times$ faster than FB-Net C at higher accuracy and speed and $>10\times$ faster than MobileOne-S1 at smaller size) while having a small model size ($>1.37\times$ smaller than MobileNetv3-L at higher accuracy and speed) and strong accuracy (1.1\% higher top-1 accuracy than MobileViT XS on ImageNet at higher speed). These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
First De-Trend then Attend: Rethinking Attention for Time-Series Forecasting
Xiyuan Zhang, Xiaoyong Jin, Karthick Gopalswamy, Gaurav Gupta, Youngsuk Park, Xingjian Shi, Hao Wang, Danielle C. Maddix, Bernie Wang
We theoretically and empirically analyze relationships between variants of attention models in time-series forecasting, and propose a decomposition-based hybrid method that achieves better performance than current attention models.
Transformer-based models have gained large popularity and demonstrated promising results in long-term time-series forecasting in recent years. In addition to learning attention in time domain, recent works also explore learning attention in frequency domains (e.g., Fourier domain, wavelet domain), given that seasonal patterns can be better captured in these domains. In this work, we seek to understand the relationships between attention models in different time and frequency domains. Theoretically, we show that attention models in different domains are equivalent under linear conditions (i.e., linear kernel to attention scores). Empirically, we analyze how attention models of different domains show different behaviors through various synthetic experiments with seasonality, trend and noise, with emphasis on the role of softmax operation therein. Both these theoretical and empirical analyses motivate us to propose a new method: TDformer (Trend Decomposition Transformer), that first applies seasonal-trend decomposition, and then additively combines an MLP which predicts the trend component with Fourier attention which predicts the seasonal component to obtain the final prediction. Extensive experiments on benchmark time-series forecasting datasets demonstrate that TDformer achieves state-of-the-art performance against existing attention-based models.
FuzzyNet: A Fuzzy Attention Module for Polyp Segmentation
Krushi Bharatbhai Patel, Fengjun Li, Guanghui Wang
A Fuzzy attention module to focus more on hard pixels lying around the boundary region of the polyp.
Polyp segmentation is essential for accelerating the diagnosis of colon cancer. However, it is challenging because of the diverse color, texture, and varying lighting effects of the polyps as well as the subtle difference between the polyp and its surrounding area. To further increase the performance of polyp segmentation, we propose to focus more on the problematic pixels that are harder to predict. To this end, we propose a novel attention module named Fuzzy Attention to focus more on the difficult pixels. Our attention module generates a high attention score for fuzzy pixels usually located near the boundary region. This module can be embedded in any convolution neural network-based backbone network. We embed our module with various backbone networks: Res2Net, ConvNext and Pyramid Vision Transformer and evaluate the models on five polyp segmentation datasets: Kvasir, CVC-300, CVC-ColonDB, CVC-ClinicDB, and ETIS. Our attention module with Res2Net as the backbone network outperforms the reverse attention-based PraNet by a significant amount on all datasets. In addition, our module with PVT as the backbone network achieves state-of-the-art accuracy of 0.937, 0.811, and 0.791 on the CVC-ClinicDB, CVC-ColonDB, and ETIS, respectively, outperforming the latest SA-Net, TransFuse and Polyp-PVT.
Graph Attention for Spatial Prediction
Corban Rivera, Ryan W. Gardner
We introduced an allocentric graph attention approach for spatial reasoning and object localization
Imbuing robots with human-levels of intelligence is a longstanding goal of AI research. A critical aspect of human-level intelligence is spatial reasoning. Spatial reasoning requires a robot to reason about relationships among objects in an environment to estimate the positions of unseen objects. In this work, we introduced a novel graph attention approach for predicting the locations of query objects in partially observable environments. We found that our approach achieved state of the art results on object location prediction tasks. Then, we evaluated our approach on never before seen objects, and we observed zero-shot generalization to estimate the positions of new object types.
Improving cross-modal attention via object detection
Yongil Kim, Yerin Hwang, Seunghyun Yoon, Hyeongu Yun, Kyomin Jung
Cross-modal attention is widely used in multimodal learning to fuse information from two modalities. However, most existing models only assimilate cross-modal attention indirectly by relying on end-to-end learning and do not directly improve the attention mechanisms. In this paper, we propose a methodology for directly enhancing cross-modal attention by utilizing object-detection models for vision-and-language tasks that deal with image and text information. We used the mask of the detected objects obtained by the detection model as a pseudo label, and we added a loss between the attention map of the multimodal learning model and the pseudo label. The proposed methodology drastically improves the performance of the baseline model across all performance metrics in various popular datasets for the image-captioning task. Moreover, our highly scalable methodology can be applied to any multimodal task in terms of vision-and-language.
Quantifying attention via dwell time and engagement in a social media browsing environment
Ziv Epstein, Hause Lin, Gordon Pennycook, David Rand
We propose a two-stage model of attention for social media environments that disentangles engagement and dwell, and show that attention operates differently in these two stages via a dissociation.
Revisiting Attention Weights as Explanations from an Information Theoretic Perspective
Bingyang Wen, Koduvayur Subbalakshmi, Fan Yang
This work evaluates the interpretability of attention weights and the results show that attention weights have the potential to be used as model's explanation.
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks
Yuxuan Li, James McClelland
We present a causal transformer that learns structured, algorithmic tasks and generalizes to longer sequences, and unpack its computation in relation to task structures by analyzing attention patterns and latent representations.
TDLR: Top Semantic-Down Syntactic Language Representation
Vipula Rawte, Megha Chakraborty, Kaushik Roy, Manas Gaur, Keyur Faldu, Prashant Kikani, Hemang Akbari, Amit P. Sheth
TDLR framework to infuse knowledge (common-sense) in language models in a top-down (semantic-syntactic) manner.
Language understanding involves processing text with both the grammatical and common-sense contexts of the text fragments. The text "I went to the grocery store and brought home a car" requires both the grammatical context (syntactic) and common-sense context (semantic) to capture the oddity in the sentence. Contextualized text representations learned by Language Models (LMs) are expected to capture a variety of syntactic and semantic contexts from large amounts of training data corpora. Recent work such as ERNIE has shown that infusing the knowledge contexts, where they are available in LMs, results in significant performance gains on General Language Understanding (GLUE) benchmark tasks. However, to our knowledge, no knowledge-aware model has attempted to infuse knowledge through top-down semantics-driven syntactic processing (Eg: Common-sense to Grammatical) and directly operated on the attention mechanism that LMs leverage to learn the data context. We propose a learning framework Top-Down Language Representation (TDLR) to infuse common-sense semantics into LMs. In our implementation, we build on BERT for its rich syntactic knowledge and use the knowledge graphs ConceptNet and WordNet to infuse semantic knowledge.
The Paradox of Choice: On the Role of Attention in Hierarchical Reinforcement Learning
Andrei Cristian Nica, Khimya Khetarpal, Doina Precup
We characterize affordances as a hard-attention mechanism in hierarchical RL and investigate the role of hard versus soft attention in different scenarios, empirically demonstrating the "paradox of choice".
Decision-making AI agents are often faced with two important challenges: the depth of the planning horizon, and the branching factor due to having many choices. Hierarchical reinforcement learning methods aim to solve the first problem, by providing shortcuts that skip over multiple time steps. To cope with the breadth, it is desirable to restrict the agent's attention at each step to a reasonable number of possible choices. The concept of affordances (Gibson, 1977) suggests that only certain actions are feasible in certain states. In this work, we first characterize "affordances" as a "hard" attention mechanism that strictly limits the available choices of temporally extended options. We then investigate the role of hard versus soft attention in training data collection, abstract value learning in long-horizon tasks, and handling a growing number of choices. To this end, we present an online, model-free algorithm to learn affordances that can be used to further learn subgoal options. Finally, we identify and empirically demonstrate the settings in which the "paradox of choice" arises, i.e. when having fewer but more meaningful choices improves the learning speed and performance of a reinforcement learning agent.
Unlocking Slot Attention by Changing Optimal Transport Costs
Yan Zhang, David W Zhang, Simon Lacoste-Julien, Gertjan J. Burghouts, Cees G. M. Snoek
Slot attention can do tiebreaking by changing the costs for optimal transport to minimize entropy, which improves results significantly on object detection
Slot attention is a successful method for object-centric modeling with images and videos for tasks like unsupervised object discovery. However, set-equivariance limits its ability to perform tiebreaking, which makes distinguishing similar structures difficult – a task crucial for vision problems. To fix this, we cast cross-attention in slot attention as an optimal transport (OT) problem that has solutions with the desired tiebreaking properties. We then propose an entropy minimization module that combines the tiebreaking properties of unregularized OT with the speed of regularized OT. We evaluate our method on CLEVR object detection and observe significant improvements from 53% to 91% on a strict average precision metric.
Organizers
Scott Niekum
University of Texas, Austin / University of Massachusetts
Relevant Publications
CogSci and Neuroscience
Mary M. Hayhoe, Dana H. Ballard (2005). Eye Movements in Natural Behavior. Trends in Cognitive Sciences. Link
Grace W. Lindsay (2020). Attention in Psychology, Neuroscience, and Machine Learning. Front. Comput. Neurosci.. Link
Machine Learning, Deep Learning and Reinforcement Learning
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin (2017). Attention is All You Need. NeurIPS. Link
Ruohan Zhang, Akanksha Saran, Bo Liu, Yifeng Zhu, Sihang Guo, Scott Niekum, Dana H. Ballard, Mary M. Hayhoe (2020). Human Gaze Assisted Artificial Intelligence – A Review. IJCAI. Link
Jürgen Schmidhuber (2020). Neural nets learn to program neural nets with fast weights—like today’s Transformer variants. AI Blog, Jürgen Schmidhuber. Link
Jürgen Schmidhuber (2021). End-to-End Differentiable Sequential Neural Attention. AI Blog, Jürgen Schmidhuber. Link
Akanksha Saran, Ruohan Zhang, Elaine S. Short, and Scott Niekum (2021). Efficiently Guiding Imitation Learning Agents with Human Gaze. AAMAS. Link
Haiping Wu, Khimya Khetarpal, and Doina Precup (2021). Self-supervised Attention-aware Reinforcement Learning. AAAI. Link
Alex Lamb, Riashat Islam, Yonathan Efroni, Aniket Didolkar, Dipendra Misra, Dylan Foster, Lekan Molu, Rajan Chari, Akshay Krishnamurthy, John Langford (2022). Guaranteed Discovery of Controllable Latent States with Multi-Step Inverse Models. arXiv. Link
Khimya Khetarpal, and Doina Precup (2018). Attend before you act – Leveraging human visual attention for continual learning. ICML Workshop on Lifelong Learning - A Reinforcement Learning Approach. Link
Stephanie Chan, Ishita Dasgupta, Junkyung Kim, Dharshan Kumaran, Andrew K. Lampinen, and Felix Hill (2022). Transformers generalize differently from information stored in context vs in weights. arXiv. Link
Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, Maxime Gazeau, Himanshu Sahni, Satinder Singh, Volodymyr Mnih (2022). In-context Reinforcement Learning with Algorithm Distillation. arXiv. Link
Jason Ross Brown, Yiren Zhao, Ilia Shumailov, and Robert D. Mullins. (2022). Wide Attention Is The Way Forward For Transformers. arXiv. Link
Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, Cyril Zhang (2022). Transformers Learn Shortcuts to Automata. arXiv. Link
HCI, HRI and Robotics
Reuben M Aronson, Thiago Santini, Thomas C Kübler, Enkelejda Kasneci, Siddhartha Srinivasa, Henny Admoni (2018). Eye-hand behavior in human-robot shared manipulation. HRI. Link
Akanksha Saran, Elaine Schaertl Short, Andrea Thomaz, and Scott Niekum (2019). Understanding Teacher Gaze Patterns for Robot Learning. PMLR. Link
Reuben M Aronson, Henny Admoni (2020). Eye Gaze for Assistive Manipulation. HRI. Link
Reuben M. Aronson, Nadia Almutlak, and Henny Admoni (2021). Inferring Goals with Gaze during Teleoperated Manipulation. IROS. Link
Reuben M Aronson, Henny Admoni (2022). Gaze Complements Control Input for Goal Prediction During Assisted Teleoperation. RSS. Link
PS: The above references are only representative of relevant work and are not meant to be exhaustive. Please feel free to create a pull request for adding a relevant paper to the references above to the website github repository.