Benchmarking AutoML frameworks for disease prediction using medical claims-Reference-Cited by-同舟云学术

Benchmarking AutoML frameworks for disease prediction using medical claims

Published:2022-07-26 Issue:1 Volume:15 Page:
ISSN:1756-0381
Container-title:BioData Mining
language:en
Short-container-title:BioData Mining

Author:

A. Romero Roland Albert,Y. Deypalan Mariefel Nicole,Mehrotra Suchit,Jungao John Titus,Sheils Natalie E.,Manduchi Elisabetta,Moore Jason H.^ORCID

Abstract

AbstractObjectivesAscertain and compare the performances of Automated Machine Learning (AutoML) tools on large, highly imbalanced healthcare datasets.Materials and MethodsWe generated a large dataset using historical de-identified administrative claims including demographic information and flags for disease codes in four different time windows prior to 2019. We then trained three AutoML tools on this dataset to predict six different disease outcomes in 2019 and evaluated model performances on several metrics.ResultsThe AutoML tools showed improvement from the baseline random forest model but did not differ significantly from each other. All models recorded low area under the precision-recall curve and failed to predict true positives while keeping the true negative rate high. Model performance was not directly related to prevalence. We provide a specific use-case to illustrate how to select a threshold that gives the best balance between true and false positive rates, as this is an important consideration in medical applications.DiscussionHealthcare datasets present several challenges for AutoML tools, including large sample size, high imbalance, and limitations in the available features. Improvements in scalability, combinations of imbalance-learning resampling and ensemble approaches, and curated feature selection are possible next steps to achieve better performance.ConclusionAmong the three explored, no AutoML tool consistently outperforms the rest in terms of predictive performance. The performances of the models in this study suggest that there may be room for improvement in handling medical claims data. Finally, selection of the optimal prediction threshold should be guided by the specific practical application.

Funder

National Institutes of Health

Publisher

Springer Science and Business Media LLC

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Genetics,Molecular Biology,Biochemistry

Link

https://link.springer.com/content/pdf/10.1186/s13040-022-00300-2.pdf

Reference28 articles.

1. Mustafa A, Rahimi Azghadi M. Automated machine learning for healthcare and clinical notes analysis. Computers. 2021; 10(2). https://doi.org/10.3390/computers10020024.

2. Chen M, Hao Y, Hwang K, Wang L, Wang L. Disease prediction by machine learning over big data from healthcare communities: IEEE Access; 2017, pp. 1–1. https://doi.org/10.1109/ACCESS.2017.2694446.

3. Luo G, Stone BL, Johnson MD, Tarczy-Hornoch P, Wilcox AB, Mooney SD, Sheng X, Haug PJ, Nkoy FL. Automating construction of machine learning models with clinical big data: Proposal rationale and methods. JMIR Res Protoc. 2017; 6(8):175. https://doi.org/10.2196/resprot.7757.