Abstract
AbstractPrincipal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.
Publisher
Cold Spring Harbor Laboratory
Reference43 articles.
1. CUR matrix decompositions for improved data analysis
2. A Direct Formulation for Sparse PCA using Semidefinite Programming;SIAM Review,2007
3. Papailiopoulos, D. , Dimakis, A. & Korokythakis, S. Sparse PCA through Low-rank Approximations. In Proceedings of the 30th International Conference on Machine Learning, 747–755 (2013). 1, 3
4. Moghaddam, B. , Weiss, Y. & Avidan, S. Generalized Spectral Bounds for Sparse LDA. In Proceedings of the 23rd International Conference on Machine learning, 641–648 (2006). 2, 3
5. Population structure and eigenanalysis;PLoS genetics,2006