Abstract
AbstractDespite the growing interest in the role of the gut virome in human health and disease, identifying viral sequences from human gut metagenomes remains computationally challenging due to underrepresentation of viral genomes in reference databases. Several recent large-scale efforts have mined human gut metagenomes to establish viral sequence catalogues, using varied computational tools and quality control criteria. However, there has been no consistent comparison of these catalogues’ quality, diversity, and completeness, nor unification into a comprehensive resource. Here, we systematically surveyed nine previously published human gut viral catalogues, assessing their quality and the overlap of the viral sequences retrieved. While these catalogues collectively screened >40,000 human fecal metagenomes, 82% of the recovered 345,613 viral sequences were unique to one catalogue, highlighting limited redundancy. We further expanded representation by mining 7,867 infant gut metagenomes, retrieving 1,205,739 additional putative viral sequences. From these datasets, we constructed the Aggregated Gut Viral Catalogue (AVrC), a unified modular resource containing 1,018,941 dereplicated viral sequences (449,859 species-level vOTUs). Detailed annotations were generated for sequence quality, taxonomy, predicted lifestyle, and putative host. The AVrC reveals the gut virome’s substantial unexplored diversity, providing a pivotal resource for viral discovery. The AVrC is accessible as a relational database and through a web interface allowing customized querying and subset retrieval, enabling streamlined utilization by the research community and future expansions as novel data becomes available.Author summaryThe human gut is home to a vast array of viruses, collectively known as the gut virome, which play a crucial role in human health and disease. Recently, several research groups aiming at providing an overview of the Human gut viral diversity, have created catalogues of viral sequences found in the human gut by analyzing a large number of fecal samples from different individuals. In this study, we compared nine of these existing catalogues and found that there was surprisingly little overlap between them, with 82% of the viral sequences being unique to a single catalogue. To further expand the available data, we analyzed nearly 8,000 additional fecal samples from infants. By combining all this ressources, we created a unified resource called the Aggregated Gut Viral Catalogue (AVrC), which contains more than a million distinct viral sequences, representing nearly 450,000 different viral species. This catalogue, which is easily accessible to the scientific community through a user-friendly web interface, provides a valuable tool for exploring the vast diversity of the human gut virome and its potential implications for human health.
Publisher
Cold Spring Harbor Laboratory