Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data

Author:

Tian Qinzhong12ORCID,Zhang Pinglu12ORCID,Zhai Yixiao12,Wang Yansu12,Zou Quan12ORCID

Affiliation:

1. Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China , Chengdu , China

2. Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China , Quzhou 324003   China

Abstract

Abstract The advent of high-throughput sequencing technologies has not only revolutionized the field of bioinformatics but has also heightened the demand for efficient taxonomic classification. Despite technological advancements, efficiently processing and analyzing the deluge of sequencing data for precise taxonomic classification remains a formidable challenge. Existing classification approaches primarily fall into two categories, database-based methods and machine learning methods, each presenting its own set of challenges and advantages. On this basis, the aim of our study was to conduct a comparative analysis between these two methods while also investigating the merits of integrating multiple database-based methods. Through an in-depth comparative study, we evaluated the performance of both methodological categories in taxonomic classification by utilizing simulated data sets. Our analysis revealed that database-based methods excel in classification accuracy when backed by a rich and comprehensive reference database. Conversely, while machine learning methods show superior performance in scenarios where reference sequences are sparse or lacking, they generally show inferior performance compared with database methods under most conditions. Moreover, our study confirms that integrating multiple database-based methods does, in fact, enhance classification accuracy. These findings shed new light on the taxonomic classification of high-throughput sequencing data and bear substantial implications for the future development of computational biology. For those interested in further exploring our methods, the source code of this study is publicly available on https://github.com/LoadStar822/Genome-Classifier-Performance-Evaluator. Additionally, a dedicated webpage showcasing our collected database, data sets, and various classification software can be found at http://lab.malab.cn/~tqz/project/taxonomic/.

Funder

National Natural Science Foundation of China

Publisher

Oxford University Press (OUP)

Reference68 articles.

1. k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets;Ainsworth;Nucleic Acids Res,2017

2. Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses;Alam;PLoS One,2020

3. Scalable metagenomic taxonomy classification using a reference genome database;Ames;Bioinforma,2013

4. Species determination using AI machine-learning algorithms: Hebeloma as a case study;Bartlett;IMA Fungus,2022

5. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4;Blanco-Míguez;Nat Biotechnol,2023

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3