A novel feature selection technique for enhancing performance of unbalanced text classification problem-Reference-Cited by-同舟云学术

A novel feature selection technique for enhancing performance of unbalanced text classification problem

Published:2022-04-18 Issue:1 Volume:16 Page:51-69
ISSN:1872-4981
Container-title:Intelligent Decision Technologies
language:
Short-container-title:IDT

Author:

Behera Santosh Kumar,Dash Rajashree

Abstract

Since the last few decades, Text Classification (TC) is being witnessed as an important research direction due to the availability of a huge amount of digital text documents on the web. It would be tedious to manually organize and label them by human experts. Again digging a large number of highly sparse terms and skewed categories present in the documents put a lot of challenges in the correct labeling of the unlabeled documents. Hence feature selection is an essential aspect in text classification, which aims to select more concise and relevant features for further mining of the documents. Additionally, if the text in the document set is associated with multiple categories and the distribution of classes in the dataset is unbalanced, it imposes more challenges on the suitable selection of features for text classification. In this paper, a Modified Chi-Square (ModCHI) based feature selection technique is proposed for enhancing the performance of classification of multi-labeled text documents with unbalanced class distributions. It is an improved version of the Chi-square (Chi) method, which emphasizes selecting maximum features from the classes with a large number of training and testing documents. Unlike Chi, in which the top features are selected with top Chi value, in this proposed technique a score is calculated by considering the total number of relevant documents corresponding to each class with respect to the total number of documents in the original dataset. According to the score the features related to the highly relevant classes as well as high Chi-square value are selected for further processing. The proposed technique is verified with four different classifiers such as Linear SVM (LSVM), Decision tree (DT), Multilevel KNN (MLKNN), Random Forest (RF) over Reuters benchmark multi-labeled, multi-class, unbalanced dataset. The effectiveness of the model is also tested by comparing it with four other traditional feature selection techniques such as term frequency-inverse document frequency (TF-IDF), Chi-square, and Mutual Information (MI). From the experimental outcomes, it is clearly inferred that LSVM with ModCHI produces the highest precision value of 0.94, recall value of 0.80, f-measure of 0.86 and the least hamming loss value of 0.003 with a feature size 1000. The proposed feature selection technique with LSVM produces an improvement of 3.33%, 2.19%, 16.25% in the average precision value, 3.03%, 33.33%, 21.42% in the average recall value, 4%, 34.48%, 14.70% in average F-measure value and 14%, 37.68%, 31.74% in average hamming loss value compared to TF-IDF, Chi and MI techniques respectively. These findings clearly interpret the better performance of the proposed feature selection technique compared to TF_IDF, Chi and MI techniques on the unbalanced Reuters Dataset.

Publisher

IOS Press

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Human-Computer Interaction,Software

Reference26 articles.

1. A comprehensive survey on various feature selection methods to categorize text documents;Harish;International Journal of Computer Applications,2017

2. Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. Journal of Big Data. 2015; 2(1): 23.

3. Deep feature weighting for naive Bayes and its application to text classification;Jiang;Engineering Applications of Artificial Intelligence,2016

4. Opinion mining and sentiment analysis;Bakshi;2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom),2016

5. Building a K-nearest neighbor classifier for text categorization;Nikhath;International Journal of Computer Science and Information Technologies,2016

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Research on the classification of winding machine faults based on the ETL model structure;2023 2nd International Conference on Robotics, Artificial Intelligence and Intelligent Control (RAIIC);2023-08-11