Affiliation:
1. Tezpur University
2. University of Colorado, Colorado Springs
Abstract
Stemming is a basic method for morphological normalization of natural language texts. In this study, we focus on the problem of stemming several resource-poor languages from Eastern India, viz., Assamese, Bengali, Bishnupriya Manipuri and Bodo. While Assamese, Bengali and Bishnupriya Manipuri are Indo-Aryan, Bodo is a Tibeto-Burman language. We design a rule-based approach to remove suffixes from words. To reduce over-stemming and under-stemming errors, we introduce a dictionary of frequent words. We observe that, for these languages a dominant amount of suffixes are single letters creating problems during suffix stripping. As a result, we introduce an HMM-based hybrid approach to classify the mis-matched last character. For each word, the stem is extracted by calculating the most probable path in four HMM states. At each step we measure the stemming accuracy for each language. We obtain 94% accuracy for Assamese and Bengali and 87%, and 82% for Bishnupriya Manipuri and Bodo, respectively, using the hybrid approach. We compare our work with Morfessor [Creutz and Lagus 2005]. As of now, there is no reported work on stemming for Bishnupriya Manipuri and Bodo. Our results on Assamese and Bengali show significant improvement over prior published work [Sarkar and Bandyopadhyay 2008; Sharma et al. 2002, 2003].
Publisher
Association for Computing Machinery (ACM)
Reference54 articles.
1. Towards an error-free Arabic stemming
2. L. S. Bora. 2006. Asamiya Bhasar Ruptattva. M/s Banalata Guwahati Assam India. L. S. Bora. 2006. Asamiya Bhasar Ruptattva . M/s Banalata Guwahati Assam India.
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献