Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review-Reference-Cited by-同舟云学术

Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review

Published:2023-04-28 Issue:9 Volume:13 Page:5521
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Maurício José¹^ORCID,Domingues Inês¹^ORCID,Bernardino Jorge¹^ORCID

Affiliation:

1. Polytechnic of Coimbra, Coimbra Institute of Engineering (ISEC), Rua Pedro Nunes, 3030-199 Coimbra, Portugal

Abstract

Transformers are models that implement a mechanism of self-attention, individually weighting the importance of each part of the input data. Their use in image classification tasks is still somewhat limited since researchers have so far chosen Convolutional Neural Networks for image classification and transformers were more targeted to Natural Language Processing (NLP) tasks. Therefore, this paper presents a literature review that shows the differences between Vision Transformers (ViT) and Convolutional Neural Networks. The state of the art that used the two architectures for image classification was reviewed and an attempt was made to understand what factors may influence the performance of the two deep learning architectures based on the datasets used, image size, number of target classes (for the classification problems), hardware, and evaluated architectures and top results. The objective of this work is to identify which of the architectures is the best for image classification and under what conditions. This paper also describes the importance of the Multi-Head Attention mechanism for improving the performance of ViT in image classification.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/9/5521/pdf

Reference30 articles.

1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., and Gelly, S. (2020). An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv.

2. Saha, S. (2023, January 08). A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way. Available online: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53.

3. Literature Review as a Research Methodology: An Overview and Guidelines;Snyder;J. Bus. Res.,2019

4. Software Defect Prediction Using Ensemble Learning: A Systematic Literature Review;Matloob;IEEE Access,2021

5. Benz, P., Ham, S., Zhang, C., Karjauv, A., and Kweon, I.S. (2021). Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer to CNNs. arXiv.

Cited by 87 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Smart and user-centric manufacturing information recommendation using multimodal learning to support human-robot collaboration in mixed reality environments;Robotics and Computer-Integrated Manufacturing;2025-02

2. Automated computer vision based individual salmon (Salmo salar) breathing rate estimation (SaBRE) for improved state observability;Aquaculture;2025-01

3. Cross-modal domain generalization semantic segmentation based on fusion features;Knowledge-Based Systems;2024-10

4. 3D feature characterization of flotation froth based on a dual-attention encoding volume stereo matching model and binocular stereo vision extraction;Minerals Engineering;2024-10

5. Vision transformer promotes cancer diagnosis: A comprehensive review;Expert Systems with Applications;2024-10