Abstract
In a standard text classification (TC) study, preprocessing is one of the key components to improve performance. This study aims to look at how preprocessing effects TC according to news text, text language, and feature selection. All potential combinations of commonly used preprocessing techniques are compared on one domain, namely news data, and in two different news datasets for this aim. Preprocessing technique contributions to classification performance at multiple feature sizes, possible interconnections among these techniques, and technique dependency on corresponding languages are all evaluated in this way. Using best combinations of preprocessing techniques rather than using or not using them all, experimental studies on public datasets reveals that, choosing best combinations of preprocessing techniques can improve classification accuracy significantly.
Publisher
Sakarya University Journal of Computer and Information Sciences
Reference24 articles.
1. [1] G. Salton, A. Wong, and C.-S. Yang, "A vector space model for automatic indexing". Communications of the ACM, 1975. 18(11): p. 613-620.
2. [2] T. Joachims, "Text categorization with support vector machines: Learning with many relevant features". in European conference on machine learning. 1998. Springer.
3. [3] Y. Yang, and J.O. Pedersen. "A comparative study on feature selection in text categorization." in ICML. 1997.
4. [4] C. Lee, and G.G. Lee," Information gain and divergence-based feature selection for machine learning-based text categorization." Information processing & management, 2006. 42(1): p. 155-165.
5. [5] S.R. Singh, H.A. Murthy, and T.A. Gonsalves, "Feature Selection for Text Classification Based on Gini Coefficient of Inequality. "Fsdm, 2010. 10: p. 76-85.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. The Effect of Document Length on Machine Learning Success in Text-Based Data;2023 Innovations in Intelligent Systems and Applications Conference (ASYU);2023-10-11