Stemming resource-poor Indian languages-Reference-Cited by-同舟云学术

Stemming resource-poor Indian languages

Published:2014-10-03 Issue:3 Volume:13 Page:1-26
ISSN:1530-0226
Container-title:ACM Transactions on Asian Language Information Processing
language:en
Short-container-title:ACM Transactions on Asian Language Information Processing

Author:

Saharia Navanath¹,Sharma Utpal¹,Kalita Jugal²

Affiliation:

1. Tezpur University

2. University of Colorado, Colorado Springs

Abstract

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While Assamese, Bengali and Bishnupriya Manipuri are Indo-Aryan, Bodo is a Tibeto-Burman language. We design a rule-based approach to remove suffixes from words. To reduce over-stemming and under-stemming errors, we introduce a dictionary of frequent words. We observe that, for these languages a dominant amount of suffixes are single letters creating problems during suffix stripping. As a result, we introduce an HMM-based hybrid approach to classify the mis-matched last character. For each word, the stem is extracted by calculating the most probable path in four HMM states. At each step we measure the stemming accuracy for each language. We obtain 94% accuracy for Assamese and Bengali and 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively, using the hybrid approach. We compare our work with Morfessor [Creutz and Lagus 2005]. As of now, there is no reported work on stemming for Bishnupriya Manipuri and Bodo. Our results on Assamese and Bengali show significant improvement over prior published work [Sarkar and Bandyopadhyay 2008; Sharma et al. 2002, 2003].

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/2629670

Reference54 articles.

1. Towards an error-free Arabic stemming

2. L. S. Bora. 2006. Asamiya Bhasar Ruptattva. M/s Banalata Guwahati Assam India. L. S. Bora. 2006. Asamiya Bhasar Ruptattva . M/s Banalata Guwahati Assam India.

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SUSTEM: An Improved Rule-based Sundanese Stemmer;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-06-21

2. An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems;IEEE Access;2023

3. Designing Stemmer for Afaraf Text Using Rule Based Approach;Innovations in Computer Science and Engineering;2022

4. Improving stemming for Assamese information retrieval;International Journal of Information Technology;2021-07-10

5. Design and Development of Unsupervised Stemmer for Sindhi Language;Procedia Computer Science;2020