Author:
Rubén Pérez-Bucio,Enault François,Galiez Clovis
Abstract
AbstractOver the last twenty years, hundreds of metagenomic studies have generated millions of viral genomic sequences from a wide variety of ecosystems. Despite this, the overall genetic diversity of viruses remains elusive, both in terms of the number of protein families they encode and the diversity of these families. Indeed, even if it is recognized that the organization of the viral protein sequence space requires sensitive homology detection methods, such methods have never been applied at a large scale. To produce a more realistic and comprehensive view of the protein diversity in the viral world, we have (i) collected thousands of viromes and identified viral contigs and proteins within them, (ii) retrieved viral proteins available in different public databases, and (iii) applied sensitive similarity searches to cluster all these proteins into families. More than 46 million deduplicated proteins were clustered into less than 2.3 million protein families. An iterative procedure to detect and remove genomic sequences of cellular origin, specially developed here, showed that only a very small fraction of sequences were likely to be cellular contamination (∼2 % of contigs, 7 K clusters). The remaining 2,203,457 clusters were coined enVhogs (for environmental Viral homologous groups). Their multiple sequence alignments have been transformed into HMMs to constitute the EnVhog database. Even if only a small proportion of enVhogs were annotated (15.9 %), they encompass almost half of the protein dataset (44.8 %). Applied to the annotation of four recently published viromes from diverse environments (sulfuric soil, grassland, surface seawater and human gut), enVhog HMMs doubled the number of viral sequences characterized, and increased by 54%-74% the number of proteins functionally annotated. EnVhog, the largest comprehensive compilation of viral protein information to date, will thus further help to determine the functions of proteins encoded in newly sequenced viral genomes, and help to improve the accuracy of viral sequence detection tools.EnVhog database is available athttp://envhog.u-ga.fr/envhog.
Publisher
Cold Spring Harbor Laboratory