A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets-Reference-Cited by-同舟云学术

A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets

Published:2022-12-30 Issue:1 Volume:24 Page:
ISSN:1467-5463
Container-title:Briefings in Bioinformatics
language:en
Short-container-title:

Author:

Lee Sanghun¹²³⁴^ORCID,Hahn Georg¹^ORCID,Hecker Julian²⁵^ORCID,Lutz Sharon M¹⁵⁶^ORCID,Mullin Kristina⁷,Hide Winston⁵⁸^ORCID,Bertram Lars⁹¹⁰^ORCID,DeMeo Dawn L²⁵^ORCID,Tanzi Rudolph E⁵⁷^ORCID,Lange Christoph¹²^ORCID,Prokopenko Dmitry⁵⁷^ORCID,

Affiliation:

1. Harvard University Department of Biostatistics, T.H. Chan School of Public Health, , Boston, MA , USA

2. Brigham and Women’s Hospital Channing Division of Network Medicine, , Boston, MA , USA

3. Dankook University Department of Medical Consilience, Division of Medicine, Graduate school, , Sout h Korea

4. NH Institute for Natural Product Research, Myungji Hospital , Sout h Korea

5. Harvard Medical School , Boston, MA , USA

6. Harvard Pilgrim Health Care Institute Department of Population Medicine, , Boston, MA , USA

7. Massachusetts General Hospital Genetics and Aging Unit and McCance Center for Brain Health, Department of Neurology, , Boston, MA , USA

8. Beth Israel Deaconess Medical Center Department of Pathology, , Boston, MA , USA

9. University of Lübeck Lübeck Interdisciplinary Platform for Genome Analytics, , Lübeck , Germany

10. University of Oslo Department of Psychology, , Oslo, Norway

Abstract

Abstract Genetic similarity matrices are commonly used to assess population substructure (PS) in genetic studies. Through simulation studies and by the application to whole-genome sequencing (WGS) data, we evaluate the performance of three genetic similarity matrices: the unweighted and weighted Jaccard similarity matrices and the genetic relationship matrix. We describe different scenarios that can create numerical pitfalls and lead to incorrect conclusions in some instances. We consider scenarios in which PS is assessed based on loci that are located across the genome (‘globally’) and based on loci from a specific genomic region (‘locally’). We also compare scenarios in which PS is evaluated based on loci from different minor allele frequency bins: common (>5%), low-frequency (5–0.5%) and rare (<0.5%) single-nucleotide variations (SNVs). Overall, we observe that all approaches provide the best clustering performance when computed based on rare SNVs. The performance of the similarity matrices is very similar for common and low-frequency variants, but for rare variants, the unweighted Jaccard matrix provides preferable clustering features. Based on visual inspection and in terms of standard clustering metrics, its clusters are the densest and the best separated in the principal component analysis of variants with rare SNVs compared with the other methods and different allele frequency cutoffs. In an application, we assessed the role of rare variants on local and global PS, using WGS data from multiethnic Alzheimer’s disease data sets and European or East Asian populations from the 1000 Genome Project.

Funder

National Institute of Mental Health

National Heart, Lung, and Blood Institute

National Human Genome Research Institute

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Link

https://academic.oup.com/bib/article-pdf/24/1/bbac611/48782902/bbac611.pdf

Reference38 articles.

1. Demonstrating stratification in a European American population;Campbell;Nat Genet,2005