Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data-Reference-Cited by-同舟云学术

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data

Published:2024-05 Issue:5 Volume:16 Page:
ISSN:1759-6653
Container-title:Genome Biology and Evolution
language:en
Short-container-title:

Author:

Tian Qinzhong¹²^ORCID,Zhang Pinglu¹²^ORCID,Zhai Yixiao¹²,Wang Yansu¹²,Zou Quan¹²^ORCID

Affiliation:

1. Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China , Chengdu , China

2. Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China , Quzhou 324003 China

Abstract

Abstract The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.

Funder

National Natural Science Foundation of China

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/gbe/advance-article-pdf/doi/10.1093/gbe/evae102/57677906/evae102.pdf

Reference68 articles.

1. k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets;Ainsworth;Nucleic Acids Res,2017

2. Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses;Alam;PLoS One,2020

3. Scalable metagenomic taxonomy classification using a reference genome database;Ames;Bioinforma,2013

4. Species determination using AI machine-learning algorithms: Hebeloma as a case study;Bartlett;IMA Fungus,2022

5. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4;Blanco-Míguez;Nat Biotechnol,2023