Author:
Chen Dieyi,Jin Jiashun,Ke Zheng Tracy
Abstract
Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).
Subject
Genetics (clinical),Genetics,Molecular Medicine
Reference52 articles.
1. Entrywise eigenvector analysis of random matrices with low expected rank;Abbe;Ann. statistics,2020
2. Adapting to unknown sparsity by controlling the false discovery rate;Abramovich;Ann. Statistics,2006
3. k-means++: The advantages of careful seeding;Arthur,2007
4. The generalized higher criticism for testing snp-set effects in genetic association studies;Barnett;J. Am. Stat. Assoc.,2017
5. Theoretical foundations of t-sne for visualizing high-dimensional clustered data;Cai;J. Mach. Learn. Resarch,2022