Author:
Acharya Prabin,Shakya Subarna
Abstract
In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.
Publisher
Inventive Research Organization
Reference8 articles.
1. [1] Dave, Sushant, Arun Kumar Singh, Dr Prathosh AP, and Prof Brejesh Lall. "Neural compound-word (Sandhi) generation and splitting in Sanskrit language." In 8th ACM IKDD CODS and 26th COMAD, pp. 171-177. 2021.
2. [2] Aralikatte, Rahul, Neelamadhav Gantayat, Naveen Panwar, Anush Sankaran, and Senthil Mani. "Sanskrit Sandhi Splitting using seq2 (seq)^ 2." arXiv preprint arXiv:1801.00428 (2018).
3. [3] Daðason, Jón Friðrik, David Erik Mollberg, Hrafn Loftsson, and Kristín Bjarnadóttir. "Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic." arXiv preprint arXiv:2004.07776 (2020).
4. [4] Hellwig, Oliver. "Using Recurrent Neural Networks for joint compound splitting and Sandhi resolution in Sanskrit." In 4th Biennial Workshop on Less-Resourced Languages. 2015.
5. [5] Premjith, B., Chandni Chandran, Shriganesh Bhat, Soman Kp, and P. Prabaharan. "A machine learning approach for identifying compound words from a Sanskrit text." In Proceedings of the 6th International Sanskrit Computational Linguistics Symposium, pp. 45-51. 2019.