Affiliation:
1. National University of Defense Technology
Abstract
Information explosion brings lots of challenges to text classification. The dimension disaster led to a sharp increase of computational complexity and lower classification accuracy. Therefore, it is critical to use feature selection techniques before actual classification. Automatic classification of English text has been researched for many years, but little on Chinese text. In this paper, several classic feature selection methods, namely TF, IG and CHI, are compared on classifying Chinese text. Meanwhile, we take imbalanced data into consideration in the paper. Experimental results show that CHI performed better than IG and TF when the dataset is imbalanced, but no obvious difference on balanced data.
Publisher
Trans Tech Publications, Ltd.
Reference9 articles.
1. J. Furnkranz: Round Robin classification. J. Mach. Learn., vol. 2 (2002), pp.721-747.
2. S. L. Lam and D. L. Lee: Feature Reduction for Neural Network Based Text Categorization, Proc. 6th International Conference on Database Systems for Advanced Applications, IEEE (1999), pp.195-202.
3. Y. Yang and J. P. Pedersen: A Comparative Study on Feature Selection in Text Categorization, Proc. 14th International Conference on Machine Learning, Morgan Kaufmann, (1997), pp.412-420.
4. W. Zhang, T. Yoshida and X. Tang: A comparative study of TF*IDF, LSI and multi-words for text classification, J. Expert Systems with Applications, vol. 38, (2011), pp.2758-2765.
5. Y. L. Hung: Efficient classifiers for multi-class classification problems, J. Decision Support Systems, vol. 53 (2012), pp.473-481.