Abstract
AbstractEndogenous retroviruses (ERVs) are remnants of ancient retroviral infections of mammalian germline cells. A large proportion of ERVs lose their open reading frames (ORFs), while others retain them and become exapted by the host species. However, it remains unclear what proportion of ERVs possess ORFs (ERV-ORFs), become transcribed, and serve as candidates for co-opted genes. Hence, we investigated characteristics of 176,401 ERV-ORFs containing retroviral-like protein domains (gag, pro, pol, and env) in 19 mammalian genomes. The fractions of ERVs possessing ORFs were overall small (∼0.15%) although they varied depending on domain types as well as species. The observed divergence of ERV-ORF from their consensus sequences suggested that a large proportion of ERV-ORFs either recently or anciently inserted themselves into mammalian genomes. Alternatively, very few ERVs lacking ORFs were found to exhibit similar divergence patterns. To identify ERV-ORFs transcribed as proteins, we compared ERV-ORFs with various multi-omics data including transcriptome data, trimethylation at histone H3 lysine 36, and transcription initiation sites from 2,834 cell types, and found 408 and 752 ERV-ORFs, accounting for 2-3% of all ERV-ORFs, with high transcriptional potential in humans and mice, respectively. Moreover, many of these ERV-ORFs with transcriptional potential were lineage-specific sequences exhibiting tissue-specific expression. These results suggest a possibility for the expression of uncharacterized functional genes containing ERV-ORFs hidden within mammalian genomes. Together, our analyses suggest that more ERV-ORFs may be co-opted in a host-species specific manner than we currently know, which are likely to have contributed to mammalian evolution and diversification.
Publisher
Cold Spring Harbor Laboratory