Automatic Generation of Semantically Annotated Collocation Corpus-Reference-Cited by-同舟云学术

Automatic Generation of Semantically Annotated Collocation Corpus

Published:2023-11 Issue:11 Volume: Page:113-125
ISSN:2409-8698
Container-title:Litera
language:en
Short-container-title:

Author:

Zaripova Diana Aleksandrovna,Lukashevich Natal'ya Valentinovna

Abstract

Word Sense Disambiguation (WSD) is a crucial initial step in automatic semantic analysis. It involves selecting the correct sense of an ambiguous word in a given context, which can be challenging even for human annotators. Supervised machine learning models require large datasets with semantic annotation to be effective. However, manual sense labeling can be a costly, labor-intensive, and time-consuming task. Therefore, it is crucial to develop and test automatic and semi-automatic methods of semantic annotation. Information about semantically related words, such as synonyms, hypernyms, hyponyms, and collocations in which the word appears, can be used for these purposes. In this article, we describe our approach to generating a semantically annotated collocation corpus for the Russian language. Our goal was to create a resource that could be used to improve the accuracy of WSD models for Russian. This article outlines the process of generating a semantically annotated collocation corpus for Russian and the principles used to select collocations. To disambiguate words within collocations, semantically related words defined based on RuWordNet are utilized. The same thesaurus is also used as the source of sense inventories. The methods described in the paper yield an F1-score of 80% and help to add approximately 23% of collocations with at least one ambiguous word to the corpus. Automatically generated collocation corpuses with semantic annotation can simplify the preparation of datasets for developing and testing WSD models. These corpuses can also serve as a valuable source of information for knowledge-based WSD models.

Publisher

Aurora Group, s.r.o

Subject

Colloid and Surface Chemistry,Physical and Theoretical Chemistry

Reference18 articles.

1. Pu X., Pappas N., Henderson J., Popescu-Belis A. Integrating Weakly Supervised Word Sense Disambiguation into Neural Machine Translation // Transactions of the Association for Computational Linguistics. 2018. V. 6. Pp. 635-649.

2. Blloshmi R., Pasini T., Campolungo N., Banerjee S., Navigli R., Pasi G. IR like a SIR: Sense-enhanced Information Retrieval for Multiple Languages // Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. Pp. 1030-1041.

3. Seifollahi S., Shajari M. Word Sense Disambiguation Application in Sentiment Analysis of News Headlines: an Applied Approach to FOREX Market Prediction // Journal of Intelligent Information Systems. 2019. V. 52. Pp. 57-83.

4. Maru M., Scozzafava F., Martelli F., Navigli R. SyntagNet: Challenging Supervised Word Sense Disambiguation with Lexical-semantic Combinations // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019. Pp. 3534-3540.

5. Yarowsky D. One Sense per Collocation // Proceedings of the Workshop on Human Language Technology. 1993. Pp. 266-271.