Stable and expressive recurrent vision models

Drew Linsley*

Brown University

{drew_linsley,

Alekh K Ashok*

Brown University

alekh_karkada_ashok,

Lakshmi N Govindarajan*

Brown University

lakshmi_govindarajan,

Rex Liu

Brown University

rex_liu,

Thomas Serre

Brown University

thomas_serre}@brown.edu

Code [GitHub] NeurIPS 2020 (Spotlight) [Paper]

Figure 1: Recurrent CNNs trained with backpropagation through time (BPTT) have unstable dynamics and forget task information. This pathology is corrected by our Lipschitz Coefficient Penalty (LCP). (a) Visualization of horizontal gated unit (hGRU) state spaces. Models were trained on Pathfinder-14, and state spaces were visualized by projecting hidden states onto each model's top-two eigenvectors. Grey dots are the 2D-histogram of projected hidden states, red contours are hidden state densities up to the task-optimized N steps, and blue contours are hidden state densities beyond that point (t>N, for 40 steps). Exemplar dynamics for a single image are plotted in yellow. While dynamics of the BPTT-trained model diverge when t>N, models trained with LCP did not. We refer to the learning algorithms of LCP-trained models as contractor-BPTT (C-BPTT) and contractor-RBP (C-RBP). (b) Model dynamics are reflected in their performance on Pathfinder-14. Segmentations evolve over time, as depicted by the colormap. While the BPTT-trained hGRU is accurate at N steps (red box), it fails when asked to process for longer (t=T=40, blue box). (c) Two-sample KS-tests indicate that the distance in state space between t=N and t=T hidden states is significantly greater for an hGRU trained with BPTT than an hGRU trained with C-BPTT or C-RBP. (n.s. = not significant).

Abstract

There is consensus that recurrent processes support critical visual routines in primate vision, from perceptual grouping to object recognition. These findings are consistent with a growing body of literature suggesting that recurrent connections improve generalization and learning efficiency on classic computer vision challenges. Why then, are current challenges dominated by feedforward networks? We posit that the effectiveness of recurrent vision models is bottlenecked by the widespread algorithm used for training them, "back-propagation through time" (BPTT), which has O(N) memory-complexity for training an N step model. Because of this, recurrent vision models cannot rival the enormous capacity of leading feedforward networks, nor compensate for this deficit by learning granular and complex visual routines. Here, we develop a new learning algorithm, "contractor recurrent back-propagation" (C-RBP), which achieves constant O(1) memory-complexity. We demonstrate that recurrent vision models trained with C-RBP learn long-range spatial dependencies in a synthetic contour tracing task that BPTT-trained models cannot. We further demonstrate that the leading feedforward approach to the large-scale Panoptic Segmentation MS-COCO challenge is improved when augmented with recurrent connections and trained with C-RBP. C-RBP is a general-purpose learning algorithm for any application that can benefit from expansive recurrent dynamics.

Contributions

(1) We derive a constraint for training recurrent CNNs that are both stable and expressive. We refer tothis as the Lipschitz Constant Penalty (LCP).

(2) We combine LCP with RBP to introduce "contractor-RBP" (C-RBP), a learning algorithm forrecurrent CNNs with constant memory complexity w.r.t. steps of processing.

(3) Recurrent CNNs trained with C-RBP learn difficult versions of Pathfinder that BPTT-trained models cannot due to memory constraints, generalize better to out-of-distribution exemplars, and need a fraction of the parameters of BPTT-trained models to reach high performance.

Figure 2: Enforcing contraction in recurrent CNNs improves their performance, parameter efficiency, and enables our constant-memory C-RBP learning algorithm. (a) hGRU models were trained andtested on different versions of Pathfinder. Only the version trained with C-RBP, trained for 20 steps, maintained high performance across the three datasets. (b) C-RBP models can rely on recurrent processing rather than spatially broad kernels to solve long-range spatial dependencies. BPTT -trained models cannot practically do this due to their linear memory complexity. (c) LCP improves the stability of hGRU dynamics and, as a result, the generalization of learned visual routines for contour integration. Models were trained on Pathfinder-14, and tested on all three Pathfinder datasets. hGRUs trained with C-RBP and C-BPTT generalized far better than a version trained with BPTT or a 6-layer CNN control. Numbers above each curve denote the max-performing step.

(4) C-RBP alleviates the memory bottleneck faced by recurrent CNNs on large-scale computer vision.Our C-RBP trained recurrent model outperforms the leading feedforward approach to the MS-COCO Panoptic Segmentation challenge despite using nearly 800K fewer parameters, and withoutexceeding the memory capacity of a standard NVIDIA Titan X GPU.

Figure 3: C-RBP trained recurrent vision models outperform the feedforward standard on MS-COCOPanoptic Segmentation despite using nearly 800K fewer parameters. (a) Performance of our recurrentFPN-ResNet 50 trained with C-RBP improves when trained with more steps of processing, despiteremaining constant in its memory footprint. (b) Recurrent processing refines instance segmentationsand controls false detections of the standard feedforward architecture (additional examples in SI). (c) Panoptic segmentation timecourses for an FPN-ResNet 50 trained with C-RBP for 20 steps.