A 2D convolutional neural network for taxonomic classification applied to viruses in the phylumCressdnaviricota

Author:

Gomes Ruither A. L.,Zerbini F. MuriloORCID

Abstract

ABSTRACTTaxonomy, defined as the classification of different objects/organisms into defined stable hierarchical categories (taxa), is fundamental for proper scientific communication. In virology, taxonomic assignments based on sequence alone are now possible and their use may contribute to a more precise and comprehensive framework. The current major challenge is to develop tools for the automated classification of the millions of putative new viruses discovered in metagenomic studies. Among the many tools that have been proposed, those applying machine learning (ML), mainly in the deep learning branch, stand out with highly accurate results. One ML tool recently released that uses k-mers, VirusTaxo, was the first one to be applied with success, 93% average accuracy, to all types of viruses. Nevertheless, there is a demand for new tools that are less computationally intensive. Viruses classified in the phylumCressdnaviricota, with their small and compact genomes, are good subjects for testing these new tools. Here we tested the usage of 2D convolutional neural networks for the taxonomic classification of cressdnaviricots, also testing the effect of data imbalance and two augmentation techniques by benchmarking against VirusTaxo. We were able to get perfect classification during k-fold test evaluations for balanced taxas, and more than 98% accuracy in the final pipeline tested for imbalanced datasets. The mixture of augmentation on more imbalanced groups and no augmentation for more balanced ones achieved the best score in the final test. These results indicate that these architectures can classify DNA sequences with high precision.

Publisher

Cold Spring Harbor Laboratory

Reference53 articles.

1. Abadi, M. ; Agarwal, A. ; Barham, P. ; Brevdo, E. ; Chen, Z. ; Citro, C. ; Corrado, G.S. ; Davis, A. ; Dean, J. ; Devin, M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems . arXiv, p. 1603.04467, 2016.

2. Basic local alignment search tool;Journal of Molecular Biology, v,1990

3. Araújo, F.H. ; Carneiro, A.C. ; Silva, R.R. ; Medeiros, F.N. ; Ushizima, D.M ., 2017, Redes neurais convolucionais com Tensorflow: Teoria e prática. Sociedade Brasileira De Computação. 382–406.

4. Continuous distributed representation of biological sequences for deep proteomics and genomics;PLoS ONE, v,2015

5. Incorporating machine learning into established bioinformatics frameworks;International Journal of Molecular Sciences, v,2021

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3