A Lightweight Hybrid Model with Location-Preserving ViT for Efficient Food Recognition
-
Published:2024-01-08
Issue:2
Volume:16
Page:200
-
ISSN:2072-6643
-
Container-title:Nutrients
-
language:en
-
Short-container-title:Nutrients
Author:
Sheng Guorui1ORCID, Min Weiqing23ORCID, Zhu Xiangyi1, Xu Liang1, Sun Qingshuo1, Yang Yancun1, Wang Lili1, Jiang Shuqiang23
Affiliation:
1. School of Information and Electrical Engineering, Ludong University, Yantai 264025, China 2. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China 3. School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100190, China
Abstract
Food-image recognition plays a pivotal role in intelligent nutrition management, and lightweight recognition methods based on deep learning are crucial for enabling mobile deployment. This capability empowers individuals to effectively manage their daily diet and nutrition using devices such as smartphones. In this study, we propose an Efficient Hybrid Food Recognition Net (EHFR–Net), a novel neural network that integrates Convolutional Neural Networks (CNN) and Vision Transformer (ViT). We find that in the context of food-image recognition tasks, while ViT demonstrates superiority in extracting global information, its approach of disregarding the initial spatial information hampers its efficacy. Therefore, we designed a ViT method termed Location-Preserving Vision Transformer (LP–ViT), which retains positional information during the global information extraction process. To ensure the lightweight nature of the model, we employ an inverted residual block on the CNN side to extract local features. Global and local features are seamlessly integrated by directly summing and concatenating the outputs from the convolutional and ViT structures, resulting in the creation of a unified Hybrid Block (HBlock) in a coherent manner. Moreover, we optimize the hierarchical layout of EHFR–Net to accommodate the unique characteristics of HBlock, effectively reducing the model size. Our extensive experiments on three well-known food image-recognition datasets demonstrate the superiority of our approach. For instance, on the ETHZ Food–101 dataset, our method achieves an outstanding recognition accuracy of 90.7%, which is 3.5% higher than the state-of-the-art ViT-based lightweight network MobileViTv2 (87.2%), which has an equivalent number of parameters and calculations.
Reference51 articles.
1. A Survey on Food Computing;Min;ACM Comput. Surv.,2019 2. A review on food recognition technology for health applications;Allegra;Health Psychol. Res.,2020 3. Rostami, A., Nagesh, N., Rahmani, A., and Jain, R.C. (2022, January 10). World Food Atlas for Food Navigation. Proceedings of the 7th International Workshop on Multimedia Assisted Dietary Management on Multimedia Assisted Dietary Management, ACM, Lisbon, Portugal. 4. Rostami, A., Pandey, V., Nag, N., Wang, V., and Jain, R.C. (2020, January 12–16). Personal Food Model. Proceedings of the MM ’20: The 28th ACM International Conference on Multimedia, ACM, Virtual Event. 5. Ishino, A., Yamakata, Y., Karasawa, H., and Aizawa, K. (2021, January 20–24). RecipeLog: Recipe Authoring App for Accurate Food Recording. Proceedings of the MM’21: The 29th ACM Multimedia Conference, ACM, Virtual Event.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Lightweight Food Recognition via Aggregation Block and Feature Encoding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-07-22 2. Food Computing for Nutrition and Health;2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW);2024-05-13
|
|