Lightweight Food Recognition via Aggregation Block and Feature Encoding-Reference-Cited by-同舟云学术

Lightweight Food Recognition via Aggregation Block and Feature Encoding

Published:2024-07-22 Issue: Volume: Page:
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Yang Yancun¹^ORCID,Min Weiqing²^ORCID,Song Jingru¹^ORCID,Sheng Guorui¹^ORCID,Wang Lili¹^ORCID,Jiang Shuqiang²^ORCID

Affiliation:

1. School of Information and Electrical Engineering, Ludong University, China

2. Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, China

Abstract

Food image recognition has recently been given considerable attention in the multimedia field in light of its possible implications on health. The characteristics of the dispersed distribution of ingredients in food images put forward higher requirements on the long-range information extraction ability of neural networks, leading to more complex and deeper models. Nevertheless, the lightweight version of food image recognition is essential for improved implementation on end devices and sustained server-side expansion. To address this issue, we present Aggregation Feature Net(AFNet), a lightweight network that is capable of effectively capturing both global and local features from food images. In AFNet, we develop a novel convolution based on a residual model by encoding global features through row-wise and column-wise information integration. Merging aggregation block with classic local convolution yields a framework that works as the backbone of the network. Based on the efficient use of parameters by the aggregation block, we constructed a lightweight food image recognition network with fewer layers and a smaller scale, assisted by a new type of activation function. Experimental results on four popular food recognition datasets demonstrate that our approach achieves state-of-the-art performance with higher accuracy and fewer FLOPs and parameters. For example, in comparison to the current state-of-the-art model of MobileViTv2, AFNet achieved 88.4% accuracy of the top-1 level on the ETHZ Food-101 dataset, with similar parameters and FLOPs but 1.4% more accuracy. The source code will be provided in supplementary materials.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3680285

Reference60 articles.

1. Food-101 – Mining Discriminative Components with Random Forests

2. Optimization Methods for Large-Scale Machine Learning

3. Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. 2023. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12021–12031.

4. Deep-based Ingredient Recognition for Cooking Recipe Retrieval

5. Mobile-Former: Bridging MobileNet and Transformer