Abstract
We developed a k-mer-based pipeline, namely the Pathogen Origin Recognition Tool using Enriched K-mers (PORT-EK) to identify genomic regions enriched in the respective hosts after the comparison of metagenomes of isolates between two host species. Using it we identified thousands of k-mers enriched in US white-tailed deer and betacoronaviruses in bat reservoirs while comparing them with human isolates. We demonstrated different coverage landscapes of k-mers enriched in deer and bats and unraveled 148 mutations in enriched k-mers yielded from the comparison of viral metagenomes between bat and human isolates. We observed that the third position within a genetic codon is prone to mutations, resulting in a high frequency of synonymous mutations of amino acids harboring the same physicochemical properties as unaltered amino acids. Finally, we classified and predicted the likelihood of host species based on the enriched k-mer counts. Altogether, PORT-EK showcased its feasibility for identifying enriched viral genomic regions, illuminating the different intrinsic tropisms of coronavirus after host domestication.