Valid inference for machine learning-assisted GWAS-Reference-Cited by-同舟云学术

Valid inference for machine learning-assisted GWAS

Published:2024-01-04 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Miao Jiacheng^ORCID,Wu Yixuan,Sun Zhongxuan,Miao Xinran,Lu Tianyuan^ORCID,Zhao Jiwei,Lu Qiongshi^ORCID

Abstract

AbstractMachine learning (ML) has revolutionized analytical strategies in almost all scientific disciplines including human genetics and genomics. Due to challenges in sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS) which uses sophisticated ML to impute phenotypes and then performs GWAS on imputed outcomes has quickly gained popularity in complex trait genetics research. However, the validity of associations identified from ML-assisted GWAS has not been carefully evaluated. In this study, we report pervasive risks for false positive associations in ML-assisted GWAS, and introduce POP-GWAS, a novel statistical framework that reimagines GWAS on ML-imputed outcomes. POP-GWAS provides valid statistical inference irrespective of the quality of imputation or variables and algorithms used for imputation. It also only requires GWAS summary statistics as input. We employed POP-GWAS to perform the largest GWAS of bone mineral density (BMD) derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 novel loci reaching genome-wide significance and revealing skeletal site-specific genetic architecture of BMD. Our framework may fundamentally reshape the analytical strategies in future ML-assisted GWAS.

Publisher

Cold Spring Harbor Laboratory

Reference59 articles.

1. Genome-wide association studies;Nature Reviews Methods Primers,2021

2. Dahl, A. et al. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nature Genetics (2023).

3. An, U. et al. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nature Genetics (2023).

4. Genome-wide analysis of a model-derived binge eating disorder phenotype identifies risk loci and implicates iron metabolism;Nature Genetics,2023

5. Cosentino, J. et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nature Genetics, 1–9 (2023).