Multi-Dimensional Fusion Attention Mechanism with Vim-like Structure for Mobile Network Design
-
Published:2024-07-31
Issue:15
Volume:14
Page:6670
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Shi Jialiang12ORCID, Zhou Rigui12ORCID, Ren Pengju12, Long Zhengyu12
Affiliation:
1. College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China 2. Research Center of Intelligent Information Processing and Quantum Intelligent Computing, Shanghai 201306, China
Abstract
Recent advancements in mobile neural networks, such as the squeeze-and-excitation (SE) attention mechanism, have significantly improved model performance. However, they often overlook the crucial interaction between location information and channels. The interaction of multiple dimensions in feature engineering is of paramount importance for achieving high-quality results. The Transformer model and its successors, such as Mamba and Vision Mamba, have effectively combined features and linked location information. This approach has transitioned from NLP (natural language processing) to CV (computer vision). This paper introduces a novel attention mechanism for mobile neural networks inspired by the structure of Vim (Vision Mamba). It adopts a “1 + 3” architecture to embed multi-dimensional information into channel attention, termed ”Multi-Dimensional Vim-like Attention Mechanism”. The proposed method splits the input into two major branches: the left branch retains the original information for subsequent feature screening, while the right branch divides the channel attention into three one-dimensional feature encoding processes. These processes aggregate features along one channel direction and two spatial directions, simultaneously capturing remote dependencies and preserving precise location information. The resulting feature maps are then combined with the left branch to produce direction-aware, location-sensitive, and channel-aware attention maps. The multi-dimensional Vim-like attention module is simple and can be seamlessly integrated into classical mobile neural networks such as MobileNetV2 and ShuffleNetV2 with minimal computational overhead. Experimental results demonstrate that this attention module adapts well to mobile neural networks with a low parameter count, delivering excellent performance on the CIFAR-100 and MS COCO datasets.
Funder
National Key R&D Plan
Reference48 articles.
1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017, January 4–9). Attention Is All You Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA. 2. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv. 3. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7–9). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. Proceedings of the 32nd International Conference on Machine Learning, Lille, France. 4. Shaw, P., Uszkoreit, J., and Vaswani, A. (2018, January 1–6). Self-Attention with Relative Position Representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA. 5. Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., and Pang, R. (November, January 27). Searching for MobileNetV3. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.
|
|