Data Harmonization Guidelines to Combine Multi‐platform Genomic Data from Admixed Populations and Boost Power in Genome‐Wide Association Studies

Author:

Croock Dayna1ORCID,Swart Yolandi1ORCID,Schurz Haiko1ORCID,Petersen Desiree C.1ORCID,Möller Marlo12ORCID,Uren Caitlin12ORCID

Affiliation:

1. DSI‐NRF Centre of Excellence for Biomedical Tuberculosis Research, South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences Stellenbosch University Stellenbosch South Africa

2. Centre for Bioinformatics and Computational Biology Stellenbosch University Stellenbosch South Africa

Abstract

AbstractData harmonization involves combining data from multiple independent sources and processing the data to produce one uniform dataset. Merging separate genotypes or whole‐genome sequencing datasets has been proposed as a strategy to increase the statistical power of association tests by increasing the effective sample size. However, data harmonization is not a widely adopted strategy due to the difficulties with merging data (including confounding produced by batch effects and population stratification). Detailed data harmonization protocols are scarce and are often conflicting. Moreover, data harmonization protocols that accommodate samples of admixed ancestry are practically non‐existent. Existing data harmonization procedures must be modified to ensure the heterogeneous ancestry of admixed individuals is incorporated into additional downstream analyses without confounding results. Here, we propose a set of guidelines for merging multi‐platform genetic data from admixed samples that can be adopted by any investigator with elementary bioinformatics experience. We have applied these guidelines to aggregate 1544 tuberculosis (TB) case‐control samples from six separate in‐house datasets and conducted a genome‐wide association study (GWAS) of TB susceptibility. The GWAS performed on the merged dataset had improved power over analyzing the datasets individually and produced summary statistics free from bias introduced by batch effects and population stratification. © 2024 The Author(s). Current Protocols published by Wiley Periodicals LLC.Basic Protocol 1: Processing separate datasets comprising array genotype dataAlternate Protocol 1: Processing separate datasets comprising array genotype and whole‐genome sequencing dataAlternate Protocol 2: Performing imputation using a local reference panelBasic Protocol 2: Merging separate datasetsBasic Protocol 3: Ancestry inference using ADMIXTURE and RFMixBasic Protocol 4: Batch effect correction using pseudo‐case‐control comparisons

Publisher

Wiley

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3