mbkmeans: Fast clustering for single cell data using mini-batch k-means
-
Published:2021-01-26
Issue:1
Volume:17
Page:e1008625
-
ISSN:1553-7358
-
Container-title:PLOS Computational Biology
-
language:en
-
Short-container-title:PLoS Comput Biol
Author:
Hicks Stephanie C.ORCID,
Liu RuoxiORCID,
Ni YuweiORCID,
Purdom ElizabethORCID,
Risso DavideORCID
Abstract
Single-cell RNA-Sequencing (scRNA-seq) is the most widely used high-throughput technology to measure genome-wide gene expression at the single-cell level. One of the most common analyses of scRNA-seq data detects distinct subpopulations of cells through the use of unsupervised clustering algorithms. However, recent advances in scRNA-seq technologies result in current datasets ranging from thousands to millions of cells. Popular clustering algorithms, such as k-means, typically require the data to be loaded entirely into memory and therefore can be slow or impossible to run with large datasets. To address this problem, we developed the mbkmeans R/Bioconductor package, an open-source implementation of the mini-batch k-means algorithm. Our package allows for on-disk data representations, such as the common HDF5 file format widely used for single-cell data, that do not require all the data to be loaded into memory at one time. We demonstrate the performance of the mbkmeans package using large datasets, including one with 1.3 million cells. We also highlight and compare the computing performance of mbkmeans against the standard implementation of k-means and other popular single-cell clustering methods. Our software package is available in Bioconductor at https://bioconductor.org/packages/mbkmeans.
Funder
National Institutes of Health
Chan Zuckerberg Initiative DAF
ENS-CFM Data Science Chair
Ministero dell’Istruzione, dell’Università e della Ricerca
Publisher
Public Library of Science (PLoS)
Subject
Computational Theory and Mathematics,Cellular and Molecular Neuroscience,Genetics,Molecular Biology,Ecology,Modelling and Simulation,Ecology, Evolution, Behavior and Systematics
Reference54 articles.
1. Clustering Algorithms: Their Application to Gene Expression Data;J Oyelade;Bioinform Biol Insights,2016
2. Machine Learning for Medical Imaging;BJ Erickson;Radiographics,2017
3. Identifying cell populations with scRNASeq;TS Andrews;Mol Aspects Med,2018
4. Challenges in unsupervised clustering of single-cell RNA-seq data;VY Kiselev;Nature Reviews Genetics,2019
5. Orchestrating single-cell analysis with Bioconductor;RA Amezquita;Nat Methods,2019
Cited by
44 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献