Abstract
AbstractGenomes have an inherent context dictated by the order in which the nucleotides and higher order genomic elements are arranged in the DNA/RNA. Learning this context is a daunting task, governed by the combinatorial complexity of interactions possible between ordered elements of genomes. Can natural language processing be employed on these orderly, complex and also evolving datatypes (genomic sequences) to reveal the latent patterns or context of genomic elements (e.g Mutations)? Here we present an approach to understand the mutational landscape of Covid-19 by treating the temporally changing (continuously mutating) SARS-CoV-2 genomes as documents. We demonstrate how the analogous interpretation of evolving genomes to temporal literature corpora provides an opportunity to use dynamic topic modeling (DTM) and temporal Word2Vec models to delineate mutation signatures corresponding to different Variants-of-Concerns and tracking the semantic drift of Mutations-of-Concern (MoC). We identified and studied characteristic mutations affiliated to Covid-infection severity and tracked their relationship with MoCs. Our ground work on utility of such temporal NLP models in genomics could supplement ongoing efforts in not only understanding the Covid pandemic but also provide alternative strategies in studying dynamic phenomenon in biological sciences through data science (especially NLP, AI/ML).
Publisher
Cold Spring Harbor Laboratory
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献