Stemming resource-poor Indian languages

Author:

Saharia Navanath1,Sharma Utpal1,Kalita Jugal2

Affiliation:

1. Tezpur University

2. University of Colorado, Colorado Springs

Abstract

Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While Assamese, Bengali and Bishnupriya Manipuri are Indo-Aryan, Bodo is a Tibeto-Burman language. We design a rule-based approach to remove suffixes from words. To reduce over-stemming and under-stemming errors, we introduce a dictionary of frequent words. We observe that, for these languages a dominant amount of suffixes are single letters creating problems during suffix stripping. As a result, we introduce an HMM-based hybrid approach to classify the mis-matched last character. For each word, the stem is extracted by calculating the most probable path in four HMM states. At each step we measure the stemming accuracy for each language. We obtain 94% accuracy for Assamese and Bengali and 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively, using the hybrid approach. We compare our work with Morfessor [Creutz and Lagus 2005]. As of now, there is no reported work on stemming for Bishnupriya Manipuri and Bodo. Our results on Assamese and Bengali show significant improvement over prior published work [Sarkar and Bandyopadhyay 2008; Sharma et al. 2002, 2003].

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference54 articles.

1. Towards an error-free Arabic stemming

2. L. S. Bora. 2006. Asamiya Bhasar Ruptattva. M/s Banalata Guwahati Assam India. L. S. Bora. 2006. Asamiya Bhasar Ruptattva . M/s Banalata Guwahati Assam India.

Cited by 9 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. SUSTEM: An Improved Rule-based Sundanese Stemmer;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-06-21

2. An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems;IEEE Access;2023

3. Designing Stemmer for Afaraf Text Using Rule Based Approach;Innovations in Computer Science and Engineering;2022

4. Improving stemming for Assamese information retrieval;International Journal of Information Technology;2021-07-10

5. Design and Development of Unsupervised Stemmer for Sindhi Language;Procedia Computer Science;2020

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3