Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models-Reference-Cited by-同舟云学术

Towards Improved Classification Accuracy on Highly Imbalanced Text Dataset Using Deep Neural Language Models

Published:2021-01-19 Issue:2 Volume:11 Page:869
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Shaikh Sarang^ORCID,Daudpota Sher Muhammad,Imran Ali Shariq,Kastrati Zenun^ORCID

Abstract

Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17% when datasets are balanced using generated text.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/2/869/pdf

Reference32 articles.

1. SMOTE: Synthetic Minority Over-sampling Technique

2. Augmented cyclegan: Learning many-to-many mappings from unpaired data;Almahairi;arXiv,2018

3. Rouge: A package for automatic evaluation of summaries;Lin,2004

Cited by 38 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Generating synthetic data with variational autoencoder to address class imbalance of graph attention network prediction model for construction management;Advanced Engineering Informatics;2024-10

2. Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT;Artificial Intelligence in Medicine;2024-07

3. SPRAG: building and benchmarking a Short Programming-Related Answer Grading dataset;International Journal of Data Science and Analytics;2024-06-04

4. Extracting Urgent Questions from MOOC Discussions: A BERT-Based Multi-output Classification Approach;Arabian Journal for Science and Engineering;2024-05-31

5. Artificial Intelligence for Analyzing Psychiatric Disorders in Social Media: A Quarter-Century Narrative Review of Progress and Challenges (Preprint);2024-04-23