Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets-Reference-Cited by-同舟云学术

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Published:2023-12-29 Issue:1 Volume:14 Page:141-154
ISSN:2190-7188
Container-title:Health and Technology
language:en
Short-container-title:Health Technol.

Author:

Lu Haohui,Uddin Shahadat^ORCID

Abstract

Abstract Purpose Disease risk prediction poses a significant and growing challenge in the medical field. While researchers have increasingly utilised machine learning (ML) algorithms to tackle this issue, supervised ML methods remain dominant. However, there is a rising interest in unsupervised techniques, especially in situations where data labels might be missing — as seen with undiagnosed or rare diseases. This study delves into comparing unsupervised ML models for disease prediction. Methods This study evaluated the efficacy of seven unsupervised algorithms on 15 datasets, including those of heart failure, diabetes, and breast cancer. It used six performance metrics for this comparison. They are Adjusted Rand Index, Adjusted Mutual Information, Homogeneity, Completeness, V-measure and Silhouette Coefficient. Results Among the seven unsupervised ML methods, the DBSCAN (Density-based spatial clustering of applications with noise) showed the best performance most times (31), followed by the Bayesian Gaussian Mixture (18) and Divisive clustering (15). No single model consistently outshined others across every dataset and metric. The study emphasises the crucial role of model and performance measure selections based on application-specific needs. For example, DBSCAN excels in Homogeneity, Completeness and V-measure metrics. Conversely, the Bayesian Gaussian Mixture is good in the Adjusted R and Index metric. The codes used in this study can be found at https://github.com/haohuilu/unsupervisedml/. Conclusion This research contributes deeper insights into the unsupervised ML applications in healthcare and encourages further investigations into model selection. Subsequent studies could harness genuine disease records for a more nuanced comparison and evaluation of models.

Funder

University of Sydney

Publisher

Springer Science and Business Media LLC

Subject

Biomedical Engineering,Applied Microbiology and Biotechnology,Bioengineering,Biotechnology

Link

https://link.springer.com/content/pdf/10.1007/s12553-023-00805-8.pdf

Reference55 articles.

1. Alloghani M, Al-Jumeily D, Mustafina J, Hussain A, Aljaaf AJ. A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Supervised and unsupervised learning for data science. Springer; 2020. p. 3–21.

2. Chen H, Wu L, Chen J, Lu W, Ding J. A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manage. 2022;59(2):102798.

3. Uddin S, Ong S, Lu H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep. 2022;12(1):15252.

4. Jáñez-Martino F, Alaiz-Rodríguez R, González-Castro V, Fidalgo E, Alegre E. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artif Intell Rev. 2023;56(2):1145–73.

5. Miklosik A, Evans N. Impact of big data and machine learning on digital transformation in marketing: A literature review. Ieee Access. 2020;8:101284–92.