Computing a human-like reaction time metric from stable recurrent vision models

Lore Goetschalckx*

Brown University

l o r e _ g o e t s c h a l c k x @ b r o w n . e d u

Lakshmi N Govindarajan*

Massachusetts Institute of Technology (MIT)

l a k s h m i n @ m i t . e d u

Alekh K Ashok

Brown University

a l e k h _ k a r k a d a _ a s h o k @ b r o w n . e d u

Aarit Ahuja

Brown University

a a r i t _ a h u j a @ b r o w n . e d u

David Sheinberg

Brown University

d a v i d _ s h e i n b e r g@ b r o w n . e d u

Thomas Serre

Brown University

t h o m a s _ s e r r e @ b r o w n . e d u

Code [GitHub] Dataset [GitHub] NeurIPS 2023 (Spotlight) [Paper]

Figure 1: Computing a reaction time metric from a recurrent vision model. (a) A schematic representation of training a cRNN with evidential deep learning (EDL; Sensoy 2018). Model outputs are interpreted as parameters (α) of a Dirichlet distribution over class probability estimates, with higher values (symbolized here by a darker gray) reflecting more generated evidence in favor of the corresponding class. In this framework, the width of the distribution signals the model's uncertainty (ε) about its predictions. (b) Visualization of our metric, ξ_cRNN, computed for an example stimulus (see Panel a) used in the task studied in the paper. The metric, denoted ξ_cRNN, is defined as the area under the uncertainty curve, i.e., the evolution of uncertainty (ε) over time.

Abstract

The meteoric rise in the adoption of deep neural networks as computational models of vision has inspired efforts to "align" these models with humans. One dimension of interest for alignment includes behavioral choices, but moving beyond characterizing choice patterns to capturing temporal aspects of visual decision-making has been challenging. Here, we sketch a general-purpose methodology to construct computational accounts of reaction times from a stimulus-computable, task-optimized model. Specifically, we introduce a novel metric leveraging insights from subjective logic theory summarizing evidence accumulation in recurrent vision models. We demonstrate that our metric aligns with patterns of human reaction times for stimulus manipulations across four disparate visual decision-making tasks spanning perceptual grouping, mental simulation, and scene categorization. This work paves the way for exploring the temporal alignment of model and human visual strategies in the context of various other cognitive tasks toward generating testable hypotheses for neuroscience.

Contributions

(1) We introduce a novel computational framework to train, analyze, and interpret the behavior of cRNNs on visual cognitive tasks of choice. Our framework triangulates insights from an attractor dynamics-based training routine and evidential learning theory to support stable and expressive evidence accumulation strategies.

(2) We derive a stimulus-computable, task- and model-agnostic metric to characterize evidence accumulation in cRNNs. Our metric does not require extra supervision and purely leverages internal cRNN activity dynamics.

(3) We comprehensively demonstrate the efficacy of our metric in capturing stimulus-dependent primate decision time courses in the form of reaction times (RTs) in four disparate visual cognitive challenges that include serial grouping, mental simulation, and scene categorization. To the best of our knowledge, this is the first demonstration of qualitative temporal alignments between models and primates across task paradigms.

Figure 2: Human versus cRNN temporal alignment on an incremental grouping task. (a) Description of the task (inspired by cognitive neuroscience studies). (b) Visualization of the cRNN dynamics. The two lines represent the average latent trajectories across 1K validation stimuli labeled "yes" and "no" respectively. Marker size indicates average uncertainty ε across the same stimuli. Evident from the graph is that the two trajectories start to diverge after some initial time passes, clearly separating the two classes. Owing to the C-RBP training algorithm and attesting to the dynamics approaching an equilibrium, the step sizes become increasingly small over time. We also include snapshots of the latent activity in the cRNN for two example stimuli (one "yes", one "no"; see Panel a) along the respective trajectory. Notice the spread of activity over time. The cRNN gradually fills the segment containing the dots. The strategy can be appreciated even better in the videos supplied in the SI. (c) Comparison against data from the experiment with human participants in Jeurissen (2016). The position of one dot (white, labeled "Fix") was kept constant, while the position of the other was manipulated to create experimental conditions of varying difficulty. These manipulations have qualitatively similar effects on the cRNN as they do on human behavior. Error bars represent the standard error of the mean. The α levels for shown contrasts were adjusted downward from .1 (*), .05*, .01**, and .001*** using Bonferroni. The spatial uncertainty map shown on the right visualizes the spatial anisotropy observed in the model for the same example stimulus shown on the left.

Figure 3: ξ_cRNN captures the spatial anisotropy in model RTs. We present spatial uncertainty (SU) maps to visualize the impact a given fixation location will have on model RTs when the cue dot position is varied. Higher (lower) values of ξ_cRNN represent slower (faster) model responses.

Figure 4: cRNN learns filling-in strategy to solve mazes. a. Task description. b. The cRNN model takes more time to solve mazes with longer path lengths, an effect previously found in human participants too. c. Latent activity visualizations for two yes-mazes. The cRNN gradually fills the segment containing the dots. The strategy can be appreciated even better in the videos supplied in the SI. d. Uncertainty curves for the two inputs shown in Panel c. The cRNN remains uncertain for much longer for the maze featuring the longest path. The uncertainty evolution is visualized dynamically in conjunction with the change in latent activity in the supplementary videos.

Additional materials

Incremental grouping

For the incremental grouping task, the cRNN learned a cognitively-viable filling-in strategy that was evident from inspecting the latent activity h_t over time steps t (see GIFs below). This behavior is an emergent property from purely the classification task constraint rather than direct supervision for any form of segmentation.

❮ ❯

GIFs 1: Filling-in strategy learned by the cRNN in order to solve the incremental grouping task. The task was to tell whether the two dots are on the same object (yes/no). The left panel visualizes the input to the model. The middle panel shows the uncertainty curve, with a marker added at every time step to indicate the progression of time. The right panel shows a visualization of the latent activity over time.

The tool below presents spatial uncertainty (SU) maps in an interactive format. Try clicking on a dot location!

Figure 3 (interactive): ξ_cRNN captures the spatial anisotropy in model RTs. We present spatial uncertainty (SU) maps to visualize the impact a given fixation location (colored white here) will have on model RTs when the position of the other dot is varied. Each dot configuration gives rise to its own uncertainty curve, which can be viewed by clicking on the respective cue dot location. Higher (lower) values of ξ_cRNN, the area under the uncertainty curve, represent slower (faster) model responses.

Mazes

The cRNN also learned a filling-in strategy to solve our maze task, which can be appreciated in the GIFs below.

❮ ❯

GIFs 2: Filling-in strategy learned by the cRNN in order to solve the maze task. The task was to tell whether there is a path to connect the squares (yes/no). The left panel visualizes the input to the model. The middle panel shows the uncertainty curve, with a marker added at every time step to indicate the progression of time. The right panel shows a visualization of the latent activity over time.

Publication

Computing a human-like reaction time metric from stable recurrent vision models

Lore Goetschalckx*, Lakshmi N Govindarajan*, Alekh K Ashok, Aarit Ahuja, David Sheinberg, Thomas Serre (*indicates equal contribution)

arXiv code

Website adapted from https://people.csail.mit.edu/yuewang/projects/rfs/