Affiliation:
1. Morgan State University
Abstract
Abstract
The work presents robust statistical and exploratory analysis to demonstrate the effects of performances of machine learning (ML) classifiers and sampling techniques in document datasets. 1,000 portable document format (PDF) files are divided into five labels from the World Health Organization COVID-19 Research Downloadable Articles and PubMed Central databases for positive and negative papers. PDF files are converted into unstructured raw text files and pre-processed before tokenization. Training size and subsampling were varied experimentally to determine their effect on the performance measures, such as accuracy, precision, recall, and AUC. Supervised classification is performed using the Random Forest, Naïve Bayes, Decision Tree, XGBoost, and Logistic Regression. Imbalanced sampling techniques are implemented using the Synthetic Minority Oversampling Technique, Random Oversampling, Random Undersampling, TomekLinks, and NearMiss to address the problem of distribution of positive and negative samples. R and the tidyverse are used to conduct statistical and exploratory data analysis on performance metrics. The ML classifiers achieve an average precision score of 78% and a recall score of 77%, while the sampling techniques have higher average precision and recall scores of 80% and 81%, respectively. Correcting imbalanced sampling supplied significant p-values from NearMiss, ROS, and SMOTE for precision and recall scores. This work has shown with statistical significance including the analysis of variance (ANOVA) that training size variation, subsampling, and imbalanced sampling techniques with ML algorithms can improve performances in document datasets.
Publisher
Research Square Platform LLC
Reference34 articles.
1. Stefan Büttcher, Charles Clarke, and G. V. Cormack, Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, 2010.
2. N. J. Belkin and W. B. Croft, "Information Filtering and Information Retrieval: Two Sides of the Same Coin?," Commun. ACM, vol. 35, no. 12, pp. 29–38, Dec. 1992, doi: 10.1145/138859.138861.
3. K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. E. Barnes, and D. E. Brown, "Text Classification Algorithms: A Survey," CoRR, vol. abs/1904.08067, 2019, [Online]. Available: http://arxiv.org/abs/1904.08067
4. Z.-H. Zhou and X.-Y. Liu, "Training cost-sensitive neural networks with methods addressing the class imbalance problem," IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77, Jan. 2006, doi: 10.1109/TKDE.2006.17.
5. Z. Zhang, T. Jasaitis, R. Freeman, R. Alfrjani, and A. Funk, "Mining Healthcare Procurement Data Using Text Mining and Natural Language Processing -- Reflection From An Industrial Project." arXiv, Jan. 09, 2023. doi: 10.48550/arXiv.2301.03458.