Inference after latent variable estimation for single-cell RNA sequencing data

Author:

Neufeld Anna1,Gao Lucy L2,Popp Joshua3,Battle Alexis4,Witten Daniela5

Affiliation:

1. Department of Statistics, University of Washington , Seattle, WA 98195, USA

2. Department of Statistics, University of British Columbia , BC V6T 1Z4, Canada

3. Department of Biomedical Engineering, Johns Hopkins University , Baltimore, MD 21218, USA

4. Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA and Department of Computer Science, Johns Hopkins University , Baltimore, MD 21218, USA

5. University of Washington Department of Statistics, University of Washington, Seattle, WA 98195, USA and Department of Biostatistics, , Seattle, WA 98195, USA

Abstract

Summary In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the cell’s state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this article, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study and apply count splitting to a data set of pluripotent stem cells differentiating to cardiomyocytes.

Funder

Simons Foundation

Publisher

Oxford University Press (OUP)

Subject

Statistics, Probability and Uncertainty,General Medicine,Statistics and Probability

Reference27 articles.

1. Molecular cross-validation for single-cell RNA-seq;Batson,;BioRxiv,2019

2. Estimating graph dimension with cross-validated eigenvalues;Chen,,2021

3. Statistical significance of cluster membership for unsupervised evaluation of cell identities;Chung,;Bioinformatics,2020

4. Statistical significance of variables driving systematic variation in high-dimensional data;Chung,;Bioinformatics,2015

5. A note on data-splitting for the evaluation of significance levels;Cox,;Biometrika,1975

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3