Identification of bacteriophage genome sequences with representation learning-Reference-Cited by-同舟云学术

Identification of bacteriophage genome sequences with representation learning

Published:2022-08-03 Issue:18 Volume:38 Page:4264-4270
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Bai Zeheng¹^ORCID,Zhang Yao-zhong¹,Miyano Satoru¹²,Yamaguchi Rui¹³⁴,Fujimoto Kosuke⁵⁶,Uematsu Satoshi⁵⁶,Imoto Seiya¹⁶^ORCID

Affiliation:

1. Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo , Minato-ku, Tokyo 108-8639, Japan

2. M&D Data Science Center, Tokyo Medical and Dental University , Tokyo 113-8510, Japan

3. Division of Cancer Systems Biology, Aichi Cancer Center Research Institute , Nagoya 464-8681, Japan

4. Division of Cancer Informatics, Nagoya University Graduate School of Medicine , Nagoya 466-8560, Japan

5. Division of Metagenome Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo , Minato-ku, Tokyo 108-8639, Japan

6. Collaborative Research Institute for Innovative Microbiology, The University of Tokyo , Bunkyo-ku, Tokyo 113-8657, Japan

Abstract

Abstract Motivation Bacteriophages/phages are the viruses that infect and replicate within bacteria and archaea, and rich in human body. To investigate the relationship between phages and microbial communities, the identification of phages from metagenome sequences is the first step. Currently, there are two main methods for identifying phages: database-based (alignment-based) methods and alignment-free methods. Database-based methods typically use a large number of sequences as references; alignment-free methods usually learn the features of the sequences with machine learning and deep learning models. Results We propose INHERIT which uses a deep representation learning model to integrate both database-based and alignment-free methods, combining the strengths of both. Pre-training is used as an alternative way of acquiring knowledge representations from existing databases, while the BERT-style deep learning framework retains the advantage of alignment-free methods. We compare INHERIT with four existing methods on a third-party benchmark dataset. Our experiments show that INHERIT achieves a better performance with the F1-score of 0.9932. In addition, we find that pre-training two species separately helps the non-alignment deep learning model make more accurate predictions. Availability and implementation The codes of INHERIT are now available in: https://github.com/Celestial-Bai/INHERIT. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Ministry of Education, Culture, Sports, Science, and Technology of Japan

Japan Society for the Promotion of Science

JSPS KAKENHI

Japan Agency for Medical Research and Development

Uehara Memorial Foundation

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btac509/45277953/btac509.pdf

Reference48 articles.

1. Antibiotic resistance and its cost: is it possible to reverse resistance?;Andersson;Nat. Rev. Microbiol,2010

2. Seeker: alignment-free identification of bacteriophage genomes by deep learning;Auslander;Nucleic Acids Res,2020

3. Representation learning: a review and new perspectives;Bengio;IEEE Trans. Pattern Anal. Mach. Intell,2013

4. Phages and their application against drug-resistant bacteria;Chanishvili;J. Chem. Technol. Biotechnol,2001

5. Multiple sequence alignment modeling: methods and applications;Chatzou;Brief. Bioinform,2016

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. BFVD - a large repository of predicted viral protein structures;2024-09-09

2. A foundational large language model for edible plant genomes;Communications Biology;2024-07-09

3. VirRep: a hybrid language representation learning framework for identifying viruses from human gut metagenomes;Genome Biology;2024-07-04

4. PharaCon: A new framework for identifying bacteriophages via conditional representation learning;2024-06-17

5. Predicting viral host codon fitness and path shifting through tree-based learning on codon usage biases and genomic characteristics;2024-05-23