Abstract
The concept of stop words introduced by H. P. Lun in the mid-20th century plays a huge role in today’s NLP practice. Stop words are used to reduce noisy text data, remove uninformative words, speed up text processing, and minimize the amount of memory required to store data.The Kyrgyz language is an agglutinative Turkic language for which no scientific study of stop words has been previously published in English. In our study, we combined frequency analysis with rule-based linguistic analysis. First, we found the most frequently used words, set a threshold, and removed words below the threshold. This way we got a list of the most frequently used words. Then we reduced the list by excluding from the list all words that do not belong to the category of function words of the Kyrgyz language. Finally, we got a list of 50 words that can be considered stop words in the Kyrgyz language. In our analysis, we used a single corpus of sentences collected and posted as an open source project by one of the local broadcasters.
Subject
Linguistics and Language,Language and Linguistics
Reference10 articles.
1. Al-Shargabi, Bassam, Waseem Al-Romimah & Fekry Olayah. 2011. A comparative study for Arabic text classification algorithms based on stop words elimination. Proceedings of the 2011 International Conference on Intelligent Semantic Web-Services and Applications (ISWSA'11). Association for Computing Machinery, New York, NY, USA, Article 11, 1-5. https://doi.org/10.1145/1980822.1980833
2. Bell, Alan, Jason Brenier, Michelle Gregory, Cynthia Girand & Dan Jurafsky. 2009. Predictability effects on durations of content and function words in conversational English. Journal of Memory and Language 60, 92-111. https://doi.org/10.1016/j.jml.2008.06.003
3. Cordeiro, João & Pavel Brazdil. 2004. Learning Text Extraction Rules, without Ignoring Stop Words. Pattern Recognition in Information Systems 2004, 128-138. Retrieved from https://www.di.ubi.pt/~jpaulo/publications/PRIS2004.pdf
4. Kaur, Jashanjot & Preetpal Buttar. 2018. A Systematic Review on stop word Removal Algorithms. International Journal on Future Revolution in Computer Science & Communication Engineering 4 (4), 207-210. Retrieved from http://www.ijfrcsce.org/index.php/ijfrcsce/article/view/1499/1499
5. Ladani, Dhara & Nikita Desai. 2020. Stop word Identification and Removal Techniques on TC and IR applications: A Survey. 6th International Conference on Advanced Computing and Communication Systems (ICACCS), 466-472. https://doi.org/10.1109/ICACCS48705.2020.9074166