Transformers in Vision: A Survey-Reference-Cited by-同舟云学术

Transformers in Vision: A Survey

Published:2022-01-31 Issue:10s Volume:54 Page:1-41
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Khan Salman¹^ORCID,Naseer Muzammal¹^ORCID,Hayat Munawar²^ORCID,Zamir Syed Waqas³^ORCID,Khan Fahad Shahbaz⁴^ORCID,Shah Mubarak⁵^ORCID

Affiliation:

1. MBZUAI, UAE and Australian National University, Canberra, ACT, AU

2. Department of DSAI, Faculty of IT, Monash University, Clayton, Victoria, AU

3. Inception Institute of Artificial Intelligence, Masdar City, Abu Dhabi, UAE

4. MBZUAI, UAE and CVL, Linköping University, Linköping, Sweden

5. CRCV, University of Central Florida, Orlando, FL, USA

Abstract

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks, e.g., Long short-term memory. Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text, and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers, i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization), and three-dimensional analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges toward the application of transformer models in computer vision.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3505244

Reference256 articles.

1. https://www.youtube.com/watch?v=UX8OubxsY8w AAAI 2020 Keynotes Turing Award Winners Event

2. https://lambdalabs.com/blog/demystifying-gpt-3/ OpenAI’s GPT-3 Language Model: A Technical Overview

3. https://ai.googleblog.com/2017/07/revisiting-unr easonable-effectiveness.html Revisiting the Unreasonable Effectiveness of Data

4. Quantifying attention flow in transformers;Abnar Samira;arXiv:2005.00928,2020

5. Unaiza Ahsan, Rishi Madhok, and Irfan Essa. 2019. Video jigsaw: Unsupervised learning of spatiotemporal context for video action recognition. In WACV.

Cited by 706 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Applying deep learning to real-time UAV-based forest monitoring: Leveraging multi-sensor imagery for improved results;Expert Systems with Applications;2024-07

2. A comprehensive survey on applications of transformers for deep learning tasks;Expert Systems with Applications;2024-05

3. Learning multiple attention transformer super-resolution method for grape disease recognition;Expert Systems with Applications;2024-05

4. Vision transformer: To discover the “four secrets” of image patches;Information Fusion;2024-05

5. A survey of multimodal hybrid deep learning for computer vision: Architectures, applications, trends, and challenges;Information Fusion;2024-05