Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages-Reference-Cited by-同舟云学术

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Published:2010-09 Issue:3 Volume:9 Page:1-24
ISSN:1530-0226
Container-title:ACM Transactions on Asian Language Information Processing
language:en
Short-container-title:ACM Transactions on Asian Language Information Processing

Author:

Dolamic Ljiljana¹,Savoy Jacques¹

Affiliation:

1. University of Neuchatel

Abstract

The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them. In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the n -gram and trunc- n (truncation of the first n letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: tf idf and Lnu-ltc . Experiments performed with all three languages demonstrate that the I(n e )C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ~28% for the Hindi language, ~42% for Marathi, and ~18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.

Funder

Swiss National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/1838745.1838748

Reference40 articles.

1. Statistical and Comparative Evaluation of Various Indexing and Search Models

2. Probabilistic models of information retrieval based on measuring the divergence from randomness

3. Cross-Language Evaluation Forum: Objectives, Results, Achievements

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Likelihood corpus distribution: an efficient topic modelling scheme for Bengali document class identification;Sādhanā;2024-06-08

2. Building a text retrieval system for the Sanskrit language: Exploring indexing, stemming, and searching issues;Computer Speech & Language;2023-06

3. Filtering and Extended Vocabulary based Translation for Low-resource Language Pair of Sanskrit-Hindi;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-04-12

4. A Survey on NLP Resources, Tools, and Techniques for Marathi Language Processing;ACM Transactions on Asian and Low-Resource Language Information Processing;2022-12-27

5. Effect of stopwords in Indian language IR;Sādhanā;2022-01-10