Fashion-Oriented Image Captioning with External Knowledge Retrieval and Fully Attentive Gates
Author:
Moratelli Nicholas1ORCID, Barraco Manuele1ORCID, Morelli Davide1ORCID, Cornia Marcella2ORCID, Baraldi Lorenzo1ORCID, Cucchiara Rita1ORCID
Affiliation:
1. Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, 41125 Modena, Italy 2. Department of Education and Humanities, University of Modena and Reggio Emilia, 42121 Reggio Emilia, Italy
Abstract
Research related to fashion and e-commerce domains is gaining attention in computer vision and multimedia communities. Following this trend, this article tackles the task of generating fine-grained and accurate natural language descriptions of fashion items, a recently-proposed and under-explored challenge that is still far from being solved. To overcome the limitations of previous approaches, a transformer-based captioning model was designed with the integration of external textual memory that could be accessed through k-nearest neighbor (kNN) searches. From an architectural point of view, the proposed transformer model can read and retrieve items from the external memory through cross-attention operations, and tune the flow of information coming from the external memory thanks to a novel fully attentive gate. Experimental analyses were carried out on the fashion captioning dataset (FACAD) for fashion image captioning, which contains more than 130k fine-grained descriptions, validating the effectiveness of the proposed approach and the proposed architectural strategies in comparison with carefully designed baselines and state-of-the-art approaches. The presented method constantly outperforms all compared approaches, demonstrating its effectiveness for fashion image captioning.
Funder
PRIN project “CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content” Italian Ministry of University and Research YOOX-NET-A-PORTER Group
Subject
Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry
Reference65 articles.
1. Denoising Diffusion Probabilistic Models;Ho;Adv. Neural Inf. Process. Syst.,2020 2. Zhen, L., Hu, P., Wang, X., and Peng, D. (2019, January 15–20). Deep supervised cross-modal retrieval. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 3. Messina, N., Amato, G., Falchi, F., Gennaro, C., and Marchand-Maillet, S. (2021, January 28–30). Towards efficient cross-modal visual textual retrieval using transformer-encoder deep features. Proceedings of the International Conference on Content-based Multimedia Indexing, Virtual. 4. Han, X., Wu, Z., Wu, Z., Yu, R., and Davis, L.S. (2018, January 18–22). VITON: An Image-based Virtual Try-On Network. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 5. Wang, B., Zheng, H., Liang, X., Chen, Y., Lin, L., and Yang, M. (2018, January 8–14). Toward characteristic-preserving image-based virtual try-on network. Proceedings of the European Conference on Computer Vision, Munich, Germany.
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Image captioning by diffusion models: A survey;Engineering Applications of Artificial Intelligence;2024-12 2. Towards Attribute-Controlled Fashion Image Captioning;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-06-05 3. TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip;Information Technology and Control;2024-03-22 4. FashionVLM - Fashion Captioning Using Pretrained Vision Transformer and Large Language Model;2024 International Conference on Emerging Smart Computing and Informatics (ESCI);2024-03-05 5. Large Language Models versus Natural Language Understanding and Generation;Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics;2023-11-24
|
|