minicore: Fast scRNA-seq clustering with various distances-Reference-Cited by-同舟云学术

minicore: Fast scRNA-seq clustering with various distances

Published:2021-03-25 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Baker Daniel N.^ORCID,Dyjack Nathan,Braverman Vladimir,Hicks Stephanie C.^ORCID,Langmead Ben^ORCID

Abstract

AbstractSingle-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore’s novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions.Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and minibatch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels.AvailabilityThe open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

Publisher

Cold Spring Harbor Laboratory

Reference32 articles.

1. Arthur, D. , Vassilvitskii, S. : K-means++: The advantages of careful seeding. SODA p. 1027–1035 (2007)

2. Baker, D. : libsimdsampling. http://github.com/dnbaker/libsimdsampling (2008), [Online; accessed 7 Feb, 2021]

3. Distributed k-means and k-median clustering on general topologies;Advances in Neural Information Processing Systems,2013

4. Clustering with bregman divergences;Journal of Machine Learning Research,2005