Affiliation:
1. Indian Institute of Technology (B.H.U), Varanasi, India
Abstract
We explore and evaluate the effect of different stopword lists (non-corpus-based and corpus-based) in the information retrieval (IR) tasks with different Indian languages such as Bengali, Marathi, Gujarati, Hindi, and English. The issue was investigated from three viewpoints. Is there any performance difference between non-corpus-based and corpus-based stopword removal in chosen Indian languages? Can corpus-based stopword lists improve performance in Indian languages IR? If yes, to what extent? Among the different corpus-based stopword lists, which stopword list provides the best IR performance? Does the length of a corpus-based stopword list affect the retrieval performance in Indian languages? If yes, to what extent? It was observed that a corpus-based stopword list provides better retrieval performance than a non-corpus-based stopword list in different Indian languages. Among the different corpus-based stopword lists generated and experimented with, Zipf’s law-based stopword list (idf-based one) provides the best retrieval performance in various Indian languages. The aggregation1-based stopword list provides better retrieval than the aggregation2-based list in Indian languages, but in English, the aggregation2-based stopword list performs better than the aggregation1-based list. The best performing idf-based stopword list improves MAP score by 5.43% in Bengali, 1.91% in Marathi, 5.4% in Gujarati, 1.5% in Hindi, and 2.12% in English, respectively, over their baseline counterparts. The probabilistic retrieval models (BM25 and TF-IDF) perform best in different Indian languages. A smaller length of corpus-based stopword lists performs better than a larger length of non-corpus-based stopword lists for all the Indian languages considered. The proposed schemes demonstrate that a stopword list can be heuristically generated in a language-independent statistical method and effectively used for IR tasks with performance comparable, to or even better than non-corpus-based stopword lists.
Funder
IIT (B.H.U), Varanasi
National Supercomputing Mission, Government of India at the IIT
Publisher
Association for Computing Machinery (ACM)
Reference51 articles.
1. An experimental study for the effect of stop words elimination for Arabic text classification algorithms;Al-Shargabi Bassam;International Journal of Information Technology and Web Engineering (IJITWE),2011
2. Toward an Arabic stop-words list generation;Alajmi A.;International Journal of Computer Applications,2012
3. Entropy-based generic stopwords list for Yoruba texts;Asubiaro Toluwase Victor;International Journal of Computer and Information Technology,2013
4. Hakan Ayral and Sirma Yavuz. 2011. An automated domain specific stop word generation method for natural language text classification. In 2011 International Symposium on Innovations in Intelligent Systems and Applications. IEEE, 500–503.
5. Information retrieval, 2nd ed. C. J. Van Rijsbergen. London: Butterworths; 1979: 208;Blair David C.;Journal of the American Society for Information Science,1979