A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data-Reference-Cited by-同舟云学术

A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data

Published:2022-02-23 Issue: Volume:13 Page:
ISSN:1664-8021
Container-title:Frontiers in Genetics
language:
Short-container-title:Front. Genet.

Author:

Cao Xiaowen,Xing Li,Majd Elham,He Hua,Gu Junhua,Zhang Xuekui

Abstract

The new technology of single-cell RNA sequencing (scRNA-seq) can yield valuable insights into gene expression and give critical information about the cellular compositions of complex tissues. In recent years, vast numbers of scRNA-seq datasets have been generated and made publicly available, and this has enabled researchers to train supervised machine learning models for predicting or classifying various cell-level phenotypes. This has led to the development of many new methods for analyzing scRNA-seq data. Despite the popularity of such applications, there has as yet been no systematic investigation of the performance of these supervised algorithms using predictors from various sizes of scRNA-seq datasets. In this study, 13 popular supervised machine learning algorithms for cell phenotype classification were evaluated using published real and simulated datasets with diverse cell sizes. This benchmark comprises two parts. In the first, real datasets were used to assess the computing speed and cell phenotype classification performance of popular supervised algorithms. The classification performances were evaluated using the area under the receiver operating characteristic curve, F1-score, Precision, Recall, and false-positive rate. In the second part, we evaluated gene-selection performance using published simulated datasets with a known list of real genes. The results showed that ElasticNet with interactions performed the best for small and medium-sized datasets. The NaiveBayes classifier was found to be another appropriate method for medium-sized datasets. With large datasets, the performance of the XGBoost algorithm was found to be excellent. Ensemble algorithms were not found to be significantly superior to individual machine learning methods. Including interactions in the ElasticNet algorithm caused a significant performance improvement for small datasets. The linear discriminant analysis algorithm was found to be the best choice when speed is critical; it is the fastest method, it can scale to handle large sample sizes, and its performance is not much worse than the top performers.

Funder

Natural Sciences and Engineering Research Council of Canada

Publisher

Frontiers Media SA

Subject

Genetics (clinical),Genetics,Molecular Medicine

Reference44 articles.

1. A Comparison of Automatic Cell Identification Methods for Single-Cell Rna-Sequencing Data;Abdelaal;Genome Biol.,2019

2. ScPred: Accurate Supervised Method for Cell-Type Classification from Single-Cell RNA-Seq Data;Alquicira-Hernandez;Genome Biol.,2019

3. Method of the Year 2013;Editorial;Nat. Methods,2014

4. Annotating Cell Types in Human Single-Cell RNA-Seq Data with CellO;Bernstein;STAR Protoc.,2021

5. scID Uses Discriminant Analysis to Identify Transcriptionally Equivalent Cell Types across Single-Cell RNA-Seq Data with Batch Effect;Boufea;iScience,2020

Cited by 11 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Utilizing Multi-Class Classification Methods for Automated Sleep Disorder Prediction;Information;2024-07-23

2. Development of a multigenomic liquid biopsy (PROSTest) for prostate cancer in whole blood;The Prostate;2024-04-03

3. Quality adjustment and analysis of human resource prices in China: Based on a hedonic price model;PLOS ONE;2024-04-02

4. Essential elements of physical fitness analysis in male adolescent athletes using machine learning;PLOS ONE;2024-04-02

5. Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation;Health Information Science and Systems;2024-02-29