Compression of quantification uncertainty for scRNA-seq counts-Reference-Cited by-同舟云学术

Compression of quantification uncertainty for scRNA-seq counts

Published:2021-01-20 Issue:12 Volume:37 Page:1699-1707
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Van Buren Scott¹,Sarkar Hirak²³,Srivastava Avi⁴⁵^ORCID,Rashid Naim U¹⁶,Patro Rob²³,Love Michael I¹⁷^ORCID

Affiliation:

1. Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516, USA

2. Department of Computer Science, University of Maryland, College Park, MD 20742, USA

3. Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA

4. New York Genome Center, New York, NY 10013, USA

5. Center for Genomics and Systems Biology, New York University, New York, NY 10003, USA

6. Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA

7. Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27514, USA

Abstract

Abstract Motivation Quantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of ‘inferential replicates’, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements. Results We demonstrate that storing only the mean and variance from a set of inferential replicates (‘compression’) is sufficient to capture gene-level quantification uncertainty, while reducing disk storage to as low as 9% of original storage, and memory usage when loading data to as low as 6%. Using these values, we generate ‘pseudo-inferential’ replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. When applying this procedure to trajectory-based differential expression analyses, we show false positives are reduced by more than a third for genes with high levels of quantification uncertainty. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory usage without any loss in performance. Lastly, we show that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset. Availability and implementation makeInfReps and splitSwish are implemented in the R/Bioconductor fishpond package available at https://bioconductor.org/packages/fishpond. Analyses and simulated datasets can be found in the paper’s GitHub repo at https://github.com/skvanburen/scUncertaintyPaperCode. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

National Institutes of Health

National Science Foundation

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btab001/36158262/btab001.pdf

Reference50 articles.

1. Transcription-mediated gene fusion in the human genome;Akiva;Genome Res,2005

2. The ndpk/nme superfamily: state of the art;Boissan;Lab. Investig,2018

3. Near-optimal probabilistic RNA-seq quantification;Bray;Nat. Biotechnol,2016

4. Computational methods for trajectory inference from single-cell transcriptomics;Cannoodt;Eur. J. Immunol,2016

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Comprehensive analysis of genetic associations and single-cell expression profiles reveals potential links between migraine and multiple diseases: a phenome-wide association study;Frontiers in Neurology;2024-02-07

2. satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications;F1000Research;2022-08-08

3. satuRn: Scalable analysis of differential transcript usage for bulk and single-cell RNA-sequencing applications;F1000Research;2021-05-11

4. STARsolo: accurate, fast and versatile mapping/quantification of single-cell and single-nucleus RNA-seq data;2021-05-05