Generation and Splitting of the Compound Words in Nepali Text

Author:

Acharya Prabin,Shakya Subarna

Abstract

In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.

Publisher

Inventive Research Organization

Subject

General Medicine

Reference8 articles.

1. [1] Dave, Sushant, Arun Kumar Singh, Dr Prathosh AP, and Prof Brejesh Lall. "Neural compound-word (Sandhi) generation and splitting in Sanskrit language." In 8th ACM IKDD CODS and 26th COMAD, pp. 171-177. 2021.

2. [2] Aralikatte, Rahul, Neelamadhav Gantayat, Naveen Panwar, Anush Sankaran, and Senthil Mani. "Sanskrit Sandhi Splitting using seq2 (seq)^ 2." arXiv preprint arXiv:1801.00428 (2018).

3. [3] Daðason, Jón Friðrik, David Erik Mollberg, Hrafn Loftsson, and Kristín Bjarnadóttir. "Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic." arXiv preprint arXiv:2004.07776 (2020).

4. [4] Hellwig, Oliver. "Using Recurrent Neural Networks for joint compound splitting and Sandhi resolution in Sanskrit." In 4th Biennial Workshop on Less-Resourced Languages. 2015.

5. [5] Premjith, B., Chandni Chandran, Shriganesh Bhat, Soman Kp, and P. Prabaharan. "A machine learning approach for identifying compound words from a Sanskrit text." In Proceedings of the 6th International Sanskrit Computational Linguistics Symposium, pp. 45-51. 2019.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3