Collection and evaluation of lexical complexity data for Russian language using crowdsourcing-Reference-Cited by-同舟云学术

Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

Published:2022-06-29 Issue:2 Volume:26 Page:409-425
ISSN:2686-8024
Container-title:Russian Journal of Linguistics
language:
Short-container-title:Russian Journal of Linguistics

Author:

Abramov Aleksei V.^ORCID,Ivanov Vladimir V.^ORCID

Abstract

Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.

Publisher

Peoples' Friendship University of Russia

Subject

Linguistics and Language,Language and Linguistics

Reference37 articles.

1. Aprosio, Alessio P., Stefano Menini & Sara Tonelli. 2020. Adaptive complex word identification through false friend detection. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization. 192-200. https://doi.org/10.1145/3340631.3394857

2. Aroyehun, Segun Taofeek, Jason Angel, Daniel Alejandro Pérez Alvarez & Alexander Gelbukh. 2018. Complex word identification: Convolutional neural network vs. feature engineering. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications. 322-327. https://doi.org/10.18653/v1/W18-0538

3. Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3. 993-1022.

4. Bojanowski, Piotr, Edouard Grave, Armand Joulin & Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5. 135-146.

5. Burtsev, Mikhail, Alexander Seliverstov, Rafael Airapetyan, Mikhail Arkhipov, Dilyara Baymurzina, Nickolay Bushkov, Olga Gureenkova, Taras Khakhulin, Yuri Kuratov, Denis Kuznetsov, Alexey Litinsky, Varvara Logacheva, Alexey Lymar, Valentin Malykh, Maxim Petrov, Vadim Polulyakh, Leonid Pugachev, Alexey Sorokin, Maria Vikhreva & Marat Zaynutdinov. 2018. DeepPavlov: Open-source library for dialogue systems. Proceedings of ACL 2018, System Demonstrations. 122-127. https://doi.org/10.18653/v1/P18-4021

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Deep learning approaches to lexical simplification: A survey;Journal of Intelligent Information Systems;2024-09-02

2. Lexical Complexity Evaluation based on Context for Russian Language;Computación y Sistemas;2023-03-30