Term evaluation metrics in imbalanced text categorization-Reference-Cited by-同舟云学术

Term evaluation metrics in imbalanced text categorization

Published:2019-07-12 Issue:1 Volume:26 Page:31-47
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Naderalvojoud Behzad^ORCID,Akcapinar Sezer Ebru

Abstract

AbstractThis paper proposes four novel term evaluation metrics to represent documents in the text categorization where class distribution is imbalanced. These metrics are achieved from the revision of the four common term evaluation metrics:chi-square,information gain,odds ratio, andrelevance frequency. While the common metrics require a balanced class distribution, our proposed metrics evaluate the document terms under an imbalanced distribution. They calculate the degree of relatedness of terms with respect to minor and major classes by considering their imbalanced distribution. Using these metrics in the document representation makes a better distinction between the documents of the minor and major classes and improves the performance of machine learning algorithms. The proposed metrics are assessed over three popular benchmarks (two subsets of Reuters-21578 and WebKB) by using four classification algorithms: support vector machines, naive Bayes, decision trees, and centroid-based classifiers. Our empirical results indicate that the proposed metrics outperform the common metrics in the imbalanced text categorization.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference49 articles.

1. Feature selection for high-dimensional imbalanced data

2. Understanding inverse document frequency: on theoretical arguments for IDF

3. Imbalanced text classification: A term weighting approach

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A new oversampling approach based differential evolution on the safe set for highly imbalanced datasets;Expert Systems with Applications;2023-12

2. An oversampling method based on differential evolution and natural neighbors;Applied Soft Computing;2023-12

3. Sequential Three-Way Rules Class-Overlap Under-Sampling Based on Fuzzy Hierarchical Subspace for Imbalanced Data;Communications in Computer and Information Science;2023

4. INVESTIGATING TERM WEIGHTING SCHEMES ON THE CLASSIFICATION PERFORMANCE FOR THE IMBALANCED TEXT DATA;Advances and Applications in Statistics;2022-06-22

5. Designing an efficient unigram keyword detector for documents using Relative Entropy;Multimedia Tools and Applications;2022-04-22