TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes-Reference-Cited by-同舟云学术

TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes

Published:2019-04-08 Issue:19 Volume:35 Page:3679-3683
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Bose Aritra¹^ORCID,Kalantzis Vassilis²,Kontopoulou Eugenia-Maria¹,Elkady Mai¹,Paschou Peristera³,Drineas Petros¹

Affiliation:

1. Computer Science Department, Purdue University, West Lafayette, IN, USA

2. IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA

3. Department of Biological Sciences, Purdue University, West Lafayette, IN, USA

Abstract

Abstract Motivation Principal Component Analysis is a key tool in the study of population structure in human genetics. As modern datasets become increasingly larger in size, traditional approaches based on loading the entire dataset in the system memory (Random Access Memory) become impractical and out-of-core implementations are the only viable alternative. Results We present TeraPCA, a C++ implementation of the Randomized Subspace Iteration method to perform Principal Component Analysis of large-scale datasets. TeraPCA can be applied both in-core and out-of-core and is able to successfully operate even on commodity hardware with a system memory of just a few gigabytes. Moreover, TeraPCA has minimal dependencies on external libraries and only requires a working installation of the BLAS and LAPACK libraries. When applied to a dataset containing a million individuals genotyped on a million markers, TeraPCA requires <5 h (in multi-threaded mode) to accurately compute the 10 leading principal components. An extensive experimental analysis shows that TeraPCA is both fast and accurate and is competitive with current state-of-the-art software for the same task. Availability and implementation Source code and documentation are both available at https://github.com/aritra90/TeraPCA. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

National Science Foundation

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz157/28492008/btz157.pdf

Reference31 articles.

1. Fast principal component analysis of large-scale genome-wide data;Abraham;PLoS One,2014

2. FlashPCA2: principal component analysis of Biobank-scale genotype datasets;Abraham;Bioinformatics,2017

3. Fast model-based estimation of ancestry in unrelated individuals;Alexander;Genome Res,2009

4. LAPACK Users' Guide

5. Dissecting Population Substructure in India via Correlation Optimization of Genetics and Geodemographics;Bose;bioRxiv,2017

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MaSk-LMM: A Matrix Sketching Framework for Linear Mixed Models in Association Studies;Lecture Notes in Computer Science;2024

2. MaSk-LMM: A Matrix Sketching Framework for Linear Mixed Models in Association Studies;2023-11-13

3. Structure-informed clustering for population stratification in association studies;BMC Bioinformatics;2023-10-31

4. PheWAS and cross-disorder analysis reveal genetic architecture, pleiotropic loci and phenotypic correlations across 11 autoimmune disorders;Frontiers in Immunology;2023-09-21

5. Fast and accurate out-of-core PCA framework for large scale biobank data;Genome Research;2023-08-24