Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations-Reference-Cited by-同舟云学术

Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations

Published:2023-10-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bonet David^ORCID,Levin May,Montserrat Daniel Mas^ORCID,Ioannidis Alexander G.^ORCID

Abstract

Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.

Publisher

Cold Spring Harbor Laboratory

Reference51 articles.

1. Genomics is failing on diversity

2. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)

3. Genomics for the world

4. Data resource profile: understanding the patterns and determinants of health in south asians—the south asia biobank;International Journal of Epidemiology,2021

5. Chinese biobanks: present and future;Genetics Research,2013