Adapting Feature Selection Algorithms for the Classification of Chinese Texts

Author:

Liu Xuan1ORCID,Wang Shuang2,Lu Siyu2,Yin Zhengtong3ORCID,Li Xiaolu4ORCID,Yin Lirong5ORCID,Tian Jiawei2ORCID,Zheng Wenfeng2ORCID

Affiliation:

1. School of Public Affairs and Administration, University of Electronic Science and Technology of China, Chengdu 611731, China

2. School of Automation, University of Electronic Science and Technology of China, Chengdu 610054, China

3. College of Resource and Environment Engineering, Guizhou University, Guiyang 550025, China

4. School of Geographic Science, Southwest University, Chongqing 400715, China

5. Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803, USA

Abstract

Text classification has been highlighted as the key process to organize online texts for better communication in the Digital Media Age. Text classification establishes classification rules based on text features, so the accuracy of feature selection is the basis of text classification. Facing fast-increasing Chinese electronic documents in the digital environment, scholars have accumulated quite a few algorithms for the feature selection for the automatic classification of Chinese texts in recent years. However, discussion about how to adapt existing feature selection algorithms for various types of Chinese texts is still inadequate. To address this, this study proposes three improved feature selection algorithms and tests their performance on different types of Chinese texts. These include an enhanced CHI square with mutual information (MI) algorithm, which simultaneously introduces word frequency and term adjustment (CHMI); a term frequency–CHI square (TF–CHI) algorithm, which enhances weight calculation; and a term frequency–inverse document frequency (TF–IDF) algorithm enhanced with the extreme gradient boosting (XGBoost) algorithm, which improves the algorithm’s ability of word filtering (TF–XGBoost). This study randomly chooses 3000 texts from six different categories of the Sogou news corpus to obtain the confusion matrix and evaluate the performance of the new algorithms with precision and the F1-score. Experimental comparisons are conducted on support vector machine (SVM) and naive Bayes (NB) classifiers. The experimental results demonstrate that the feature selection algorithms proposed in this paper improve performance across various news corpora, although the best feature selection schemes for each type of corpus are different. Further studies of the application of the improved feature selection methods in other languages and the improvement in classifiers are suggested.

Funder

Sichuan Science and Technology Program

Sichuan Social Science Major Project

Publisher

MDPI AG

Subject

Information Systems and Management,Computer Networks and Communications,Modeling and Simulation,Control and Systems Engineering,Software

Reference40 articles.

1. Emotion classification for short texts: An improved multi-label method;Liu;Humanit. Soc. Sci. Commun.,2023

2. Machine learning in automated text categorization;Sebastiani;ACM Comput. Surv.,2002

3. Mutual information algorithms;Jiang;Mech. Syst. Signal Process.,2010

4. Lancaster, H.O., and Seneta, E. (2005). Encyclopedia of Biostatistics, John Wiley & Sons.

5. A joint multiobjective optimization of feature selection and classifier design for high-dimensional data classification;Bai;Inf. Sci.,2023

Cited by 110 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3