(Machine) Learning the mutation signatures of SARS-CoV-2: a primer for predictive prognosis

Author:

Nagpal Sunil,Pinna Nishal Kumar,Srivastava Divyanshu,Singh Rohan,Mande Sharmila S.ORCID

Abstract

AbstractMotivationContinuous emergence of new variants through appearance, accumulation and disappearance of mutations in viruses is a hallmark of many viral diseases. SARS-CoV-2 and its variants have particularly exerted tremendous pressure on global healthcare system owing to their life threatening and debilitating implications. The sheer plurality of the variants and huge scale of genome sequence data available for Covid19 have added to the challenges of traceability of mutations of concern. The latter however provides an opportunity to utilize SARS-CoV-2 genomes and the mutations therein as ‘big data records’ to comprehensively classify the variants through the (machine) learning of mutation patterns. The unprecedented sequencing effort and tracing of disease outcomes provide an excellent ground for identifying important mutations by developing machine learnt models or severity classifiers using mutation profile of SARS-CoV-2. This is expected to provide a significant impetus to the efforts towards not only identifying the mutations of concern but also exploring the potential of mutation driven predictive prognosis of SARS-CoV-2.ResultsWe describe how a graduated approach of building various severity specific machine learning classifiers, using only the mutation corpus of SARS-CoV-2 genomes, can potentially lead to the identification of important mutations and guide potential prognosis of infection. We demonstrate the applicability of model derived important mutations and use of Shapley values in order to identify the significant mutations of concern as well as for developing sparse models of outcome classification. A total of 77,284 outcome traced SARS-CoV-2 genomes were employed in this study which represented a total corpus of 30346 unique nucleotide mutations and 18647 amino acid mutations. Machine learning models pertaining to graduated classifiers of target outcomes namely ‘Asymptomatic, Mild, Symptomatic/Moderate, Severe and Fatal’ were built considering the TRIPOD guidelines for predictive prognosis. Shapley values for model linked important mutations were employed to select significant mutations leading to identification of less than 20 outcome driving mutations from each classifier. We additionally describe the significance of adopting a ‘temporal modeling approach’ to benchmark the predictive prognosis linked with continuously evolving pathogens. A chronologically distinct sampling is important in evaluating the performance of models trained on ‘past data’ in accurately classifying prognosis linked with genomes of future (observed with new mutations). We conclude that while machine learning approach can play a vital role in identifying relevant mutations, caution should be exercised in using the mutation signatures for predictive prognosis in cases where new mutations have accumulated along with the previously observed mutations of concern.Contactsharmila.mande@tcs.comSupplementary informationSupplementary data are enclosed.

Publisher

Cold Spring Harbor Laboratory

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3