Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

Author:

Dixon Jose1ORCID,Rahman Md1ORCID

Affiliation:

1. Computer Science Department, Morgan State University, Baltimore, MD 21251, USA

Abstract

The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.

Funder

National Science Foundation

Publisher

MDPI AG

Subject

Artificial Intelligence,Engineering (miscellaneous)

Reference42 articles.

1. Büttcher, S., Clarke, C., and Cormack, G.V. (2010). Information Retrieval: Implementing and Evaluating Search Engines, The MIT Press.

2. Information Filtering and Information Retrieval: Two Sides of the Same Coin?;Belkin;Commun. ACM,1992

3. Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L.E., and Brown, D.E. (2019). Text Classification Algorithms: A Survey. Information, 10.

4. Training cost-sensitive neural networks with methods addressing the class imbalance problem;Zhou;IEEE Trans. Knowl. Data Eng.,2006

5. Zhang, Z., Jasaitis, T., Freeman, R., Alfrjani, R., and Funk, A. (2023). Mining Healthcare Procurement Data Using Text Mining and Natural Language Processing—Reflection from an Industrial Project. arXiv.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3