Abstract
AbstractThe sheer diversity of unculturable viruses has prompted the need to describe new viruses through culture-independent techniques. The associated host is one important phenotypic feature that can be inferred from metagenomic viral contigs -- thanks to the development of various bioinformatic tools. Here we compare the performance of recently developed tools for virus-host prediction on a dataset of 1,046 virus-host pairs and then apply the best-performing tools on a metagenomic dataset derived from a highly diverse transiently hypersaline site known as Archaean Domes within the Cuatro Ciénegas Basin, Coahuila, Mexico. We also introduce a virus-host prediction tool called CrisprCustomDB, which uses specific criteria to solve controversial host assignments with custom spacers databases. Host-dependent alignment-based methods showed an average precision of 83% and a sensitivity from 13.7% to 17.7%, whereas host-dependent alignment-free methods achieved an average precision of 75.7% and a sensitivity of 57.5%. RaFAH, a virus-dependent alignment-based tool, had the best performance overall (F1_score = 95.7%). However, when applied to the highly diverse metagenomic dataset, the host-dependent alignment-based (e.g., CrisprCustomDB) and alignment-free (e.g., PHP) methods showed the greatest agreement with each other, even though they are fundamentally different methods. This is because instead of depending on known hosts or viruses-with-known-host databases, they can directly relate metagenomic viral contigs and metagenome-assembled genomes from the same dataset. Such methods also showed the greatest consistency between the source environment and the predicted host taxonomy, habitat, lifestyle, or metabolism, revealing that Archaean Domes viruses likely infect halophilic Archaea as well as a variety of Bacteria which may be halophilic, halotolerant, alkaliphilic, thermophilic, oligotrophic, sulfate-reducing or marine-related. Consequently, using a combination of methods and qualitative validations relating to the source environment and the predicted host biology will increase the number of correct predictions, mainly when dealing with novel viruses.
Publisher
Cold Spring Harbor Laboratory