Affiliation:
1. Technicolor, France
2. University of Kurdistan, Iran
3. Nanyang Technological University, Singapore
Abstract
The Kurdish language is an Indo-European language spoken in Kurdistan, a large geographical region in the Middle East. Despite having a large number of speakers, Kurdish is among the less-resourced languages and has not seen much attention from the IR and NLP research communities. This article reports on the outcomes of a project aimed at providing essential resources for processing Kurdish texts.
A principal output of this project is Pewan, the first standard Test Collection to evaluate Kurdish Information Retrieval systems. The other language resources that we have built include a lightweight stemmer and a list of stopwords.
Our second principal contribution is using these newly-built resources to conduct a thorough experimental study on Kurdish documents. Our experimental results show that normalization, and to a lesser extent, stemming, can greatly improve the performance of Kurdish IR systems.
Publisher
Association for Computing Machinery (ACM)
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. A Systematic Review of Stemmers of Indian and Non-Indian Vernacular Languages;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-01-15
2. CURE: Collection for Urdu Information Retrieval Evaluation and Ranking;2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2);2021-05-20
3. Text Preprocessing for Text Mining in Organizational Research: Review and Recommendations;Organizational Research Methods;2020-11-23
4. Next word prediction based on the N-gram model for Kurdish Sorani and Kurmanji;Neural Computing and Applications;2020-08-11
5. A Rule-Based Kurdish Text Transliteration System;ACM Transactions on Asian and Low-Resource Language Information Processing;2019-06-30