Benchmarking informatics approaches for virus discovery: caution is needed when combining <i>in silico</i> identification methods-Reference-Cited by-同舟云学术

Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods

Published:2024-03-19 Issue:3 Volume:9 Page:
ISSN:2379-5077
Container-title:mSystems
language:en
Short-container-title:mSystems

Author:

Hegarty Bridget¹^ORCID,Riddell V James²,Bastien Eric³,Langenfeld Kathryn⁴,Lindback Morgan³,Saini Jaspreet S.⁵,Wing Anthony³,Zhang Jessica⁶,Duhaime Melissa³^ORCID

Affiliation:

1. Department of Civil and Environmental Engineering, Case Western Reserve University, Cleveland, Ohio, USA

2. Department of Microbiology, The Ohio State University, Columbus, Ohio, USA

3. Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, USA

4. Department of Civil and Environmental Engineering, Stanford University, Palo Alto, California, USA

5. Laboratory for Environmental Biotechnology, Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

6. Department of Civil and Environmental Engineering, University of Michigan, Ann Arbor, Michigan, USA

Abstract

ABSTRACT Understanding the ecological impacts of viruses on natural and engineered ecosystems relies on the accurate identification of viral sequences from community sequencing data. To maximize viral recovery from metagenomes, researchers frequently combine viral identification tools. However, the effectiveness of this strategy is unknown. Here, we benchmarked combinations of six widely used informatics tools for viral identification and analysis (VirSorter, VirSorter2, VIBRANT, DeepVirFinder, CheckV, and Kaiju), called “rulesets.” Rulesets were tested against mock metagenomes composed of taxonomically diverse sequence types and diverse aquatic metagenomes to assess the effects of the degree of viral enrichment and habitat on tool performance. We found that six rulesets achieved equivalent accuracy [Matthews Correlation Coefficient (MCC) = 0.77, P adj ≥ 0.05]. Each contained VirSorter2, and five used our “tuning removal” rule designed to remove non-viral contamination. While DeepVirFinder, VIBRANT, and VirSorter were each found once in these high-accuracy rulesets, they were not found in combination with each other: combining tools does not lead to optimal performance. Our validation suggests that the MCC plateau at 0.77 is partly caused by inaccurate labeling within reference sequence databases. In aquatic metagenomes, our highest MCC ruleset identified more viral sequences in virus-enriched (44%–46%) than in cellular metagenomes (7%–19%). While improved algorithms may lead to more accurate viral identification tools, this should be done in tandem with careful curation of sequence databases. We recommend using the VirSorter2 ruleset and our empirically derived tuning removal rule. Our analysis provides insight into methods for in silico viral identification and will enable more robust viral identification from metagenomic data sets. IMPORTANCE The identification of viruses from environmental metagenomes using informatics tools has offered critical insights in microbial ecology. However, it remains difficult for researchers to know which tools optimize viral recovery for their specific study. In an attempt to recover more viruses, studies are increasingly combining the outputs from multiple tools without validating this approach. After benchmarking combinations of six viral identification tools against mock metagenomes and environmental samples, we found that these tools should only be combined cautiously. Two to four tool combinations maximized viral recovery and minimized non-viral contamination compared with either the single-tool or the five- to six-tool ones. By providing a rigorous overview of the behavior of in silico viral identification strategies and a pipeline to replicate our process, our findings guide the use of existing viral identification tools and offer a blueprint for feature engineering of new tools that will lead to higher-confidence viral discovery in microbiome studies.

Funder

National Science Foundation

DOC | National Oceanic and Atmospheric Administration

College of Engineering, University of Michigan

Publisher

American Society for Microbiology

Link

https://journals.asm.org/doi/pdf/10.1128/msystems.01105-23

Reference57 articles.

1. Plankton networks driving carbon export in the oligotrophic ocean

2. Viruses and Nutrient Cycles in the Sea

3. Phage-specific metabolic reprogramming of virocells

4. Viral metabolic reprogramming in marine ecosystems

5. A Broad-Host-Range, Generalized Transducing Phage (SN-T) Acquires 16S rRNA Genes from Different Genera of Bacteria

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimization of screening methods leads to the discovery of new viruses in black soldier flies (Hermetia illucens);2024-08-27

2. VirID: Beyond Virus Discovery - An Integrated Platform for Comprehensive RNA Virus Characterization;2024-07-09

3. A panoramic view of the virosphere in three wastewater treatment plants by integrating viral‐like particle‐concentrated and traditional non‐concentrated metagenomic approaches;iMeta;2024-03-29