Optimal ensemble construction for multistudy prediction with applications to mortality estimation

Author:

Loewinger Gabriel1ORCID,Nunez Rolando Acosta23,Mazumder Rahul4,Parmigiani Giovanni25

Affiliation:

1. Machine Learning Team National Institute on Mental Health Bethesda Maryland USA

2. Department of Biotatistics Harvard School of Public Health Boston Massachusetts USA

3. Regeneron Pharmaceuticals Inc. Tarrytown New York USA

4. Operations Research Center and MIT Center for Statistics MIT Sloan School of Management Cambridge Massachusetts USA

5. Department of Data Science Dana Farber Cancer Institute Boston Massachusetts USA

Abstract

It is increasingly common to encounter prediction tasks in the biomedical sciences for which multiple datasets are available for model training. Common approaches such as pooling datasets before model fitting can produce poor out‐of‐study prediction performance when datasets are heterogeneous. Theoretical and applied work has shown multistudy ensembling to be a viable alternative that leverages the variability across datasets in a manner that promotes model generalizability. Multistudy ensembling uses a two‐stage stacking strategy which fits study‐specific models and estimates ensemble weights separately. This approach ignores, however, the ensemble properties at the model‐fitting stage, potentially resulting in performance losses. Motivated by challenges in the estimation of COVID‐attributable mortality, we propose optimal ensemble construction, an approach to multistudy stacking whereby we jointly estimate ensemble weights and parameters associated with study‐specific models. We prove that limiting cases of our approach yield existing methods such as multistudy stacking and pooling datasets before model fitting. We propose an efficient block coordinate descent algorithm to optimize the loss function. We use our method to perform multicountry COVID‐19 baseline mortality prediction. We show that when little data is available for a country before the onset of the pandemic, leveraging data from other countries can substantially improve prediction accuracy. We further compare and characterize the method's performance in data‐driven simulations and other numerical experiments. Our method remains competitive with or outperforms multistudy stacking and other earlier methods in the COVID‐19 data application and in a range of simulation settings.

Funder

National Institute on Drug Abuse

National Science Foundation

National Institutes of Health

Publisher

Wiley

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3