Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature-Reference-Cited by-同舟云学术

Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

Published:2023-12-11 Issue:4 Volume:5 Page:1953-1978
ISSN:2504-4990
Container-title:Machine Learning and Knowledge Extraction
language:en
Short-container-title:MAKE

Author:

Dixon Jose¹^ORCID,Rahman Md¹^ORCID

Affiliation:

1. Computer Science Department, Morgan State University, Baltimore, MD 21251, USA

Abstract

The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.

Funder

National Science Foundation

Publisher

MDPI AG

Subject

Artificial Intelligence,Engineering (miscellaneous)

Link

https://www.mdpi.com/2504-4990/5/4/95/pdf

Reference42 articles.

1. Büttcher, S., Clarke, C., and Cormack, G.V. (2010). Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press.

2. Information Filtering and Information Retrieval: Two Sides of the Same Coin?;Belkin;Commun. ACM,1992

3. Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L.E., and Brown, D.E. (2019). Text Classification Algorithms: A Survey. Information, 10.

4. Training cost-sensitive neural networks with methods addressing the class imbalance problem;Zhou;IEEE Trans. Knowl. Data Eng.,2006

5. Zhang, Z., Jasaitis, T., Freeman, R., Alfrjani, R., and Funk, A. (2023). Mining Healthcare Procurement Data Using Text Mining and Natural Language Processing—Reflection from an Industrial Project. arXiv.