Generation of training data for named entity recognition of artworks-Reference-Cited by-同舟云学术

Generation of training data for named entity recognition of artworks

Published:2022-12-15 Issue:2 Volume:14 Page:239-260
ISSN:2210-4968
Container-title:Semantic Web
language:
Short-container-title:SW

Author:

Jain Nitisha¹,Sierra-Múnera Alejandro¹,Ehmueller Jan¹,Krestel Ralf¹

Affiliation:

1. Hasso Plattner Institute, University of Potsdam, Germany

Abstract

As machine learning techniques are being increasingly employed for text processing tasks, the need for training data has become a major bottleneck for their application. Manual generation of large scale training datasets tailored to each task is a time consuming and expensive process, which necessitates their automated generation. In this work, we turn our attention towards creation of training datasets for named entity recognition (NER) in the context of the cultural heritage domain. NER plays an important role in many natural language processing systems. Most NER systems are typically limited to a few common named entity types, such as person, location, and organization. However, for cultural heritage resources, such as digitized art archives, the recognition of fine-grained entity types such as titles of artworks is of high importance. Current state of the art tools are unable to adequately identify artwork titles due to unavailability of relevant training datasets. We analyse the particular difficulties presented by this domain and motivate the need for quality annotations to train machine learning models for identification of artwork titles. We present a framework with heuristic based approach to create high-quality training data by leveraging existing cultural heritage resources from knowledge bases such as Wikidata. Experimental evaluation shows significant improvement over the baseline for NER performance for artwork titles when models are trained on the dataset generated using our framework.

Publisher

IOS Press

Subject

Computer Networks and Communications,Computer Science Applications,Information Systems

Reference77 articles.

1. A. Akbik, D. Blythe and R. Vollgraf, Contextual string embeddings for sequence labeling, in: Proceedings of the 27th International Conference on Computational Linguistics COLING 2018, 2018, pp. 1638–1649.

2. POLYGLOT-NER: Massive Multilingual Named Entity Recognition

3. A framework for learning predictive structures from multiple tasks and unlabeled data;Ando;Journal of Machine Learning Research,2005

4. T. Bogers, I. Hendrickx, M. Koolen and S. Verberne, Overview of the SBS 2016 mining track, in: Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, 2016, pp. 1053–1063.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Joint Entity-Relation Detection and Generalization Method Based on Syntax and Semantics for Chinese Intangible Cultural Heritage Texts;Journal on Computing and Cultural Heritage;2024-01-13