FoodBase corpus: a new resource of annotated food entities

Author:

Popovski Gorjan123,Seljak Barbara Koroušić3,Eftimov Tome345

Affiliation:

1. Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, ul.Rudzer Boshkovikj 16, 1000 Skopje, Macedonia

2. Jožef Stefan International Postgraduate School, Jamova cesta 39, 1000 Ljubljana, Slovenia

3. Computer Systems Department, Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia

4. Department of Biomedical Data Science, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA

5. Center for Population Health Sciences, Stanford University, 450 Serra Mall, Stanford 94305 CA, USA

Abstract

Abstract The existence of annotated text corpora is essential for the development of public health services and tools based on natural language processing (NLP) and text mining. Recently organized biomedical NLP shared tasks have provided annotated corpora related to different biomedical entities such as genes, phenotypes, drugs, diseases and chemical entities. These are needed to develop named-entity recognition (NER) models that are used for extracting entities from text and finding their relations. However, to the best of our knowledge, there are limited annotated corpora that provide information about food entities despite food and dietary management being an essential public health issue. Hence, we developed a new annotated corpus of food entities, named FoodBase. It was constructed using recipes extracted from Allrecipes, which is currently the largest food-focused social network. The recipes were selected from five categories: ‘Appetizers and Snacks’, ‘Breakfast and Lunch’, ‘Dessert’, ‘Dinner’ and ‘Drinks’. Semantic tags used for annotating food entities were selected from the Hansard corpus. To extract and annotate food entities, we applied a rule-based food NER method called FoodIE. Since FoodIE provides a weakly annotated corpus, by manually evaluating the obtained results on 1000 recipes, we created a gold standard of FoodBase. It consists of 12 844 food entity annotations describing 2105 unique food entities. Additionally, we provided a weakly annotated corpus on an additional 21 790 recipes. It consists of 274 053 food entity annotations, 13 079 of which are unique. The FoodBase corpus is necessary for developing corpus-based NER models for food science, as a new benchmark dataset for machine learning tasks such as multi-class classification, multi-label classification and hierarchical multi-label classification. FoodBase can be used for detecting semantic differences/similarities between food concepts, and after all we believe that it will open a new path for learning food embedding space that can be used in predictive studies.

Funder

Slovenian Research Agency

European Union’s Horizon 2020

Publisher

Oxford University Press (OUP)

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,Information Systems

Reference31 articles.

1. Using text mining techniques to extract phenotypic information from the PhenoCHF corpus;Alnazzawi;BMC Med. Inform. Decis. Mak.,2015,

2. Mining patents with tmChem, GNormPlus and an ensemble of open systems;Leaman;Proceedings of The fifth BioCreative challenge evaluation workshop,2015

3. ChemSpot: a hybrid system for chemical named entity recognition;Rocktäschel;Bioinformatics,2012

4. Overview of BioNLP shared task 2011. In Proceedings of the BioNLP shared task 2011 workshop;Kim;Association for Computational Linguistics,2011,

5. Overview of BioNLP shared task 2013;Nédellec;In Proceedings of the BioNLP Shared Task 2013 Workshop,2013

Cited by 40 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Decoding the Foodome: Molecular Networks Connecting Diet and Health;Annual Review of Nutrition;2024-08-29

2. Zero-shot evaluation of ChatGPT for food named-entity recognition and linking;Frontiers in Nutrition;2024-08-13

3. Revisiting named entity recognition in food computing: enhancing performance and robustness;Artificial Intelligence Review;2024-08-10

4. Using LLMs to Extract Food Entities from Cooking Recipes;2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW);2024-05-13

5. A Survey of the Applications of Text Mining for the Food Domain;Algorithms;2024-04-25

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3