Simple or Complex? Learning to Predict Readability of Bengali Texts

Author:

Chakraborty Susmoy,Nayeem Mir Tafseer,Ahmad Wasi Uddin

Abstract

Determining the readability of a text is the first step to its simplification. In this paper, we present a readability analysis tool capable of analyzing text written in the Bengali language to provide in-depth information on its readability and complexity. Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing. Readability related research of the Bengali language so far can be considered to be narrow and sometimes faulty due to the lack of resources. Therefore, we correctly adopt document-level readability formulas traditionally used for U.S. based education system to the Bengali language with a proper age-to-age comparison. Due to the unavailability of large-scale human-annotated corpora, we further divide the document-level task into sentence-level and experiment with neural architectures, which will serve as a baseline for the future works of Bengali readability prediction. During the process, we present several human-annotated corpora and dictionaries such as a document-level dataset comprising 618 documents with 12 different grade levels, a large-scale sentence-level dataset comprising more than 96K sentences with simple and complex labels, a consonant conjunct count algorithm and a corpus of 341 words to validate the effectiveness of the algorithm, a list of 3,396 easy words, and an updated pronunciation dictionary with more than 67K words. These resources can be useful for several other tasks of this low-resource language.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 8 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Analisis Keterbacaan Teks Buku Ajar Bahasa Indonesia SMP Kelas 9 Menggunakan Formula Grafik Fry;Pubmedia Jurnal Penelitian Tindakan Kelas Indonesia;2024-05-17

2. Multisensory computer-based system for teaching sentence reading in Hindi and Bangla to children with dyslexia;Technology and Disability;2023-12-27

3. A Machine Learning-Based Readability Model for Gujarati Texts;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-12-21

4. Navigating Bengali Linguistics: Insights from Machine and Deep Learning Perspectives for Categorization of Sentences;2023 26th International Conference on Computer and Information Technology (ICCIT);2023-12-13

5. Beyond Words: Unraveling Text Complexity with Novel Dataset and A Classifier Application;2023 26th International Conference on Computer and Information Technology (ICCIT);2023-12-13

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3