ReaderBench: Multilevel analysis of Russian text characteristics-Reference-Cited by-同舟云学术

ReaderBench: Multilevel analysis of Russian text characteristics

Published:2022-06-29 Issue:2 Volume:26 Page:342-370
ISSN:2686-8024
Container-title:Russian Journal of Linguistics
language:
Short-container-title:Russian Journal of Linguistics

Author:

Corlatescu Dragos^ORCID,Ruseti Ștefan^ORCID,Dascalu Mihai^ORCID

Abstract

This paper introduces an adaptation of the open source ReaderBench framework that now supports Russian multilevel analyses of text characteristics, while integrating both textual complexity indices and state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT). The evaluation of the proposed processing pipeline was conducted on a dataset containing Russian texts from two language levels for foreign learners (A - Basic user and B - Independent user). Our experiments showed that the ReaderBench complexity indices are statistically significant in differentiating between the two classes of language level, both from: a) a statistical perspective, where a Kruskal-Wallis analysis was performed and features such as the “nmod” dependency tag or the number of nouns at the sentence level proved the be the most predictive; and b) a neural network perspective, where our model combining textual complexity indices and contextualized embeddings obtained an accuracy of 92.36% in a leave one text out cross-validation, outperforming the BERT baseline. ReaderBench can be employed by designers and developers of educational materials to evaluate and rank materials based on their difficulty, as well as by a larger audience for assessing text complexity in different domains, including law, science, or politics.

Publisher

Peoples' Friendship University of Russia

Subject

Linguistics and Language,Language and Linguistics

Reference64 articles.

1. Abadi, Martin. 2016. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16) Savannah, GA, USA: {USENIX} Association. 265-283.

2. Akhtiamov, Raouf B. 2019. Dictionary of abstract and concrete words of the Russian language: A methodology for creation and application. Journal of Research in Applied Linguistics. Saint Petersburg, Russia: Springer. 218-230.

3. Bansal, S. 2014. Textstat. Retrieved September 1st, 2021. URL: https://github.com/shivam5992/textstat (accessed 26.05.2022).

4. Blei, David M., Andrew Y. Ng & Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3(4-5). 993-1022.

5. BNC Consortium. 2007. British national corpus. Oxford Text Archive Core Collection.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Parametric Taxonomy of Educational Texts;Vestnik Volgogradskogo gosudarstvennogo universiteta. Serija 2. Jazykoznanije;2024-02