Author:
Venkataramanan Revathy,Padhee Swati,Rao Saini Rohan,Kaoshik Ronak,Sundara Rajan Anirudh,Sheth Amit
Abstract
Cross-modal recipe retrieval has gained prominence due to its ability to retrieve a text representation given an image representation and vice versa. Clustering these recipe representations based on similarity is essential to retrieve relevant information about unknown food images. Existing studies cluster similar recipe representations in the latent space based on class names. Due to inter-class similarity and intraclass variation, associating a recipe with a class name does not provide sufficient knowledge about recipes to determine similarity. However, recipe title, ingredients, and cooking actions provide detailed knowledge about recipes and are a better determinant of similar recipes. In this study, we utilized this additional knowledge of recipes, such as ingredients and recipe title, to identify similar recipes, emphasizing attention especially on rare ingredients. To incorporate this knowledge, we propose a knowledge-infused multimodal cooking representation learning network, Ki-Cook, built on the procedural attribute of the cooking process. To the best of our knowledge, this is the first study to adopt a comprehensive recipe similarity determinant to identify and cluster similar recipe representations. The proposed network also incorporates ingredient images to learn multimodal cooking representation. Since the motivation for clustering similar recipes is to retrieve relevant information for an unknown food image, we evaluated the ingredient retrieval task. We performed an empirical analysis to establish that our proposed model improves the Coverage of Ground Truth by 12% and the Intersection Over Union by 10% compared to the baseline models. On average, the representations learned by our model contain an additional 15.33% of rare ingredients compared to the baseline models. Owing to this difference, our qualitative evaluation shows a 39% improvement in clustering similar recipes in the latent space compared to the baseline models, with an inter-annotator agreement of the Fleiss kappa score of 0.35.
Subject
Artificial Intelligence,Information Systems,Computer Science (miscellaneous)
Reference41 articles.
1. Layer normalization;Ba;arXiv,2016
2. “Learning local feature descriptors with triplets and shallow convolutional neural networks,”;Balntas,2016
3. “Cross-modal retrieval in the cooking context: learning semantic text-image embeddings,”;Carvalho,2018
4. “Zero-shot ingredient recognition by multi-relational graph convolutional network,”;Chen,2020
5. “Personalized food recommendation as constrained question answering over a large-scale food knowledge graph,”;Chen,2021