Highly accurate classification and discovery of microbial protein-coding gene functions using FunGeneTyper: an extensible deep learning framework

Author:

Zhang Guoqing1234,Wang Hui53,Zhang Zhiguo23,Zhang Lu23,Guo Guibing6,Yang Jian78,Yuan Fajie53,Ju Feng23478ORCID

Affiliation:

1. College of Environmental and Resource Sciences, Zhejiang University , Hangzhou, Zhejiang 310058 , China

2. Key Laboratory of Coastal Environment and Resources of Zhejiang Province , School of Engineering, , Hangzhou, Zhejiang 310030 , China

3. Westlake University , School of Engineering, , Hangzhou, Zhejiang 310030 , China

4. Center of Synthetic Biology and Integrated Bioengineering, Westlake University , Hangzhou, Zhejiang 310030 , China

5. Representation Learning Laboratory , School of Engineering, , Hangzhou, Zhejiang 310030 , China

6. Software College, Northeastern University , Shenyang, Liaoning 110169 , China

7. Westlake Laboratory of Life Sciences and Biomedicine , School of Life Sciences, , Hangzhou, Zhejiang 310024 , China

8. Westlake University , School of Life Sciences, , Hangzhou, Zhejiang 310024 , China

Abstract

Abstract High-throughput DNA sequencing technologies decode tremendous amounts of microbial protein-coding gene sequences. However, accurately assigning protein functions to novel gene sequences remain a challenge. To this end, we developed FunGeneTyper, an extensible framework with two new deep learning models (i.e., FunTrans and FunRep), structured databases, and supporting resources for achieving highly accurate (Accuracy > 0.99, F1-score > 0.97) and fine-grained classification of antibiotic resistance genes (ARGs) and virulence factor genes. Using an experimentally confirmed dataset of ARGs comprising remote homologous sequences as the test set, our framework achieves by-far-the-best performance in the discovery of new ARGs from human gut (F1-score: 0.6948), wastewater (0.6072), and soil (0.5445) microbiomes, beating the state-of-the-art bioinformatics tools and sequence alignment-based (F1-score: 0.0556–0.5065) and domain-based (F1-score: 0.2630–0.5224) annotation approaches. Furthermore, our framework is implemented as a lightweight, privacy-preserving, and plug-and-play neural network module, facilitating its versatility and accessibility to developers and users worldwide. We anticipate widespread utilization of FunGeneTyper (https://github.com/emblab-westlake/FunGeneTyper) for precise classification of protein-coding gene functions and the discovery of numerous valuable enzymes. This advancement will have a significant impact on various fields, including microbiome research, biotechnology, metagenomics, and bioinformatics.

Funder

Center of Synthetic Biology and Integrated Bioengineering

Research Center for Industries of the Future

‘Pioneer’ and ‘Leading Goose’ Key R&D Program of Zhejiang

Zhejiang Provincial Natural Science Foundation of China

Publisher

Oxford University Press (OUP)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3