Efficient Content-Based Sparse Attention with Routing Transformers-Reference-Cited by-同舟云学术

Efficient Content-Based Sparse Attention with Routing Transformers

Published:2021-02 Issue: Volume:9 Page:53-68
ISSN:2307-387X
Container-title:Transactions of the Association for Computational Linguistics
language:en
Short-container-title:Transactions of the Association for Computational Linguistics

Author:

Roy Aurko¹,Saffar Mohammad¹,Vaswani Ashish¹,Grangier David¹

Affiliation:

1. Google Research.

Abstract

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic computation and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: It combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O( n1.5d) from O( n2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity), as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192. We open-source the code for Routing Transformer in Tensorflow.1

Publisher

MIT Press - Journals

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Human-Computer Interaction,Communication

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00353

Reference55 articles.

Cited by 183 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Learning to match features with discriminative sparse graph neural network;Pattern Recognition;2024-12

2. When Transformer Meets Large Graphs: An Expressive and Efficient Two-View Architecture;IEEE Transactions on Knowledge and Data Engineering;2024-10

3. Hardware–Software Co-Design Enabling Static and Dynamic Sparse Attention Mechanisms;IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems;2024-09

4. CA-Captioner: A novel concentrated attention for image captioning;Expert Systems with Applications;2024-09

5. Inductive Modeling for Realtime Cold Start Recommendations;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24