Repeated Sieving for Prediction Model Building with High-Dimensional Data-Reference-Cited by-同舟云学术

Repeated Sieving for Prediction Model Building with High-Dimensional Data

Published:2024-07-19 Issue:7 Volume:14 Page:769
ISSN:2075-4426
Container-title:Journal of Personalized Medicine
language:en
Short-container-title:JPM

Author:

Liu Lu¹,Jung Sin-Ho¹^ORCID

Affiliation:

1. Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, USA

Abstract

Background: The prediction of patients’ outcomes is a key component in personalized medicine. Oftentimes, a prediction model is developed using a large number of candidate predictors, called high-dimensional data, including genomic data, lab tests, electronic health records, etc. Variable selection, also called dimension reduction, is a critical step in developing a prediction model using high-dimensional data. Methods: In this paper, we compare the variable selection and prediction performance of popular machine learning (ML) methods with our proposed method. LASSO is a popular ML method that selects variables by imposing an L1-norm penalty to the likelihood. By this approach, LASSO selects features based on the size of regression estimates, rather than their statistical significance. As a result, LASSO can miss significant features while it is known to over-select features. Elastic net (EN), another popular ML method, tends to select even more features than LASSO since it uses a combination of L1- and L2-norm penalties that is less strict than an L1-norm penalty. Insignificant features included in a fitted prediction model act like white noises, so that the fitted model will lose prediction accuracy. Furthermore, for the future use of a fitted prediction model, we have to collect the data of all the features included in the model, which will cost a lot and possibly lower the accuracy of the data if the number of features is too many. Therefore, we propose an ML method, called repeated sieving, extending the standard regression methods with stepwise variable selection. By selecting features based on their statistical significance, it resolves the over-selection issue with high-dimensional data. Results: Through extensive numerical studies and real data examples, our results show that the repeated sieving method selects far fewer features than LASSO and EN, but has higher prediction accuracy than the existing ML methods. Conclusions: We conclude that our repeated sieving method performs well in both variable selection and prediction, and it saves the cost of future investigation on the selected factors.

Publisher

MDPI AG

Link

https://www.mdpi.com/2075-4426/14/7/769/pdf

Reference23 articles.

1. Incremental Benefits of Machine Learning—When Do We Need a Better Mousetrap;Engelhard;JAMA Cardiol.,2021

2. Regression Shrinkage and Selection via the Lasso;Tibshirani;J. R. Stat. Soc. Ser. B (Methodol.),1996

3. Liu, L., Gao, J., Beasley, G., and Jung, S.H. (2023). LASSO and Elastic Net Tend to Over-Select Features. Mathematics, 11.

4. Lee, J., Sohn, I., Do, I.G., Kim, K.M., Park, S.H., Park, J.O., Park, Y.S., Lim, H.Y., Sohn, T.S., and Bae, J.M. (2014). Nanostring-based multigene assay to predict recurrence for gastric cancer patients after surgery. PLoS ONE, 9.

5. Regularization and Variable Selection via the Elastic Net;Zou;J. R. Stat. Soc. Ser. B Statistical Methodol.,2005