Abstract
ABSTRACTSingle-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity and tissue transcriptomic complexity. However, the high frequency of dropout events in scRNA-seq data complicates downstream analyses such as cell type identification and trajectory inference. Existing imputation methods address the dropout problem but face limitations such as high computational cost and risk of over-imputation. We present SmartImpute, a novel computational framework designed for targeted imputation of scRNA-seq data. SmartImpute focuses on a predefined set of marker genes, enhancing the biological relevance and computational efficiency of the imputation process while minimizing the risk of model misspecification. Utilizing a modified Generative Adversarial Imputation Network architecture, SmartImpute accurately imputes the missing gene expression and distinguishes between true biological zeros and missing values, preventing overfitting and preserving biologically relevant zeros. To ensure reproducibility, we also provide a function based on the GPT4 model to create target gene panels depending on the tissue types and research context. Our results, based on scRNA-seq data from head and neck squamous cell carcinoma and human bone marrow, demonstrate that SmartImpute significantly enhances cell type annotation and clustering accuracy while reducing computational burden. Benchmarking against other imputation methods highlights SmartImpute’s superior performance in terms of both accuracy and efficiency. Overall, SmartImpute provides a lightweight, efficient, and biologically relevant solution for addressing dropout events in scRNA-seq data, facilitating deeper insights into cellular heterogeneity and disease progression. Furthermore, SmartImpute’s targeted approach can be extended to spatial omics data, which also contain many missing values.
Publisher
Cold Spring Harbor Laboratory