Feature selection for text categorization on imbalanced data-Reference-Cited by-同舟云学术

Feature selection for text categorization on imbalanced data

Published:2004-06 Issue:1 Volume:6 Page:80-89
ISSN:1931-0145
Container-title:ACM SIGKDD Explorations Newsletter
language:en
Short-container-title:SIGKDD Explor. Newsl.

Author:

Zheng Zhaohui¹,Wu Xiaoyun¹,Srihari Rohini¹

Affiliation:

1. University at Buffalo, Amherst, NY

Abstract

A number of feature selection metrics have been explored in text categorization, among which information gain (IG), chi-square (CHI), correlation coefficient (CC) and odds ratios (OR) are considered most effective. CC and OR are one-sided metrics while IG and CHI are two-sided. Feature selection using one-sided metrics selects the features most indicative of membership only, while feature selection using two-sided metrics implicitly combines the features most indicative of membership (e.g. positive features) and non-membership (e.g. negative features) by ignoring the signs of features. The former never consider the negative features, which are quite valuable, while the latter cannot ensure the optimal combination of the two kinds of features especially on imbalanced data. In this work, we investigate the usefulness of explicit control of that combination within a proposed feature selection framework. Using multinomial naïve Bayes and regularized logistic regression as classifiers, our experiments show both great potential and actual merits of explicitly combining positive and negative features in a nearly optimal fashion according to the imbalanced data.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/1007730.1007741

Reference16 articles.

1. Inductive learning algorithms and representations for text categorization

2. Wrappers for feature subset selection

Cited by 334 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A class-imbalanced hybrid learning strategy based on Raman spectroscopy of serum samples for the diagnosis of hepatitis B, hepatitis A, and thyroid dysfunction;Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy;2024-11

2. Sparse feature selection and rare value prediction in imbalanced regression;Information Sciences;2024-10

3. Applications of Autonomous Learning Multi Model System to Multiclass Imbalanced Datasets;2024 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE);2024-06-30

4. An optimal feature selection method for text classification through redundancy and synergy analysis;Multimedia Tools and Applications;2024-06-28

5. Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering;Journal of Big Data;2024-06-17