Abstract
AbstractThe continuing emergence of SARS-CoV-2 variants of concern (VOCs) presents a serious public health threat, exacerbating the effects of the COVID19 pandemic. Although millions of genomes have been deposited in public archives since the start of the pandemic, predicting SARS-CoV-2 clinical characteristics from the genome sequence remains challenging. In this study, we used a collection of over 29,000 high quality SARS-CoV-2 genomes to build machine learning models for predicting clinical detection cycle threshold (Ct) values, which correspond with viral load. After evaluating several machine learning methods and parameters, our best model was a random forest regressor that used 10-mer oligonucleotides as features and achieved an R2score of 0.521 ± 0.010 (95% confidence interval over 5 folds) and an RMSE of 5.7 ± 0.034, demonstrating the ability of the models to detect the presence of a signal in the genomic data. In an attempt to predict Ct values for newly emerging variants, we predicted Ct values for Omicron variants using models trained on previous variants. We found that approximately 5% of the data in the model needed to be from the new variant in order to learn its Ct values. Finally, to understand how the model is working, we evaluated the top features and found that the model is using a multitude of k-mers from across the genome to make the predictions. However, when we looked at the top k-mers that occurred most frequently across the set of genomes, we observed a clustering of k-mers that span spike protein regions corresponding with key variations that are hallmarks of the VOCs including G339, K417, L452, N501, and P681, indicating that these sites are informative in the model and may impact the Ct values that are observed in clinical samples.
Publisher
Cold Spring Harbor Laboratory
Reference56 articles.
1. Anonymous. 2020. WHO COVID-19 Dashboard, on World Health Organization. https://covid19.who.int/. Accessed 09/06/2022.
2. Anonymous. SARS-CoV-2 Variant Classifications and Definitions, on Centers for Disease Control and Prevention, National Center for Immunization and Respiratory Diseases (NCIRD), Division of Viral Diseases. https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html. Accessed 10-24-2022.
3. Salehi-Vaziri M , Fazlalipour M , Seyed Khorrami SM , Azadmanesh K , Pouriayevali MH , Jalali T , Shoja Z , Maleki A. 2022. The ins and outs of SARS-CoV-2 variants of concern (VOCs). Archives of Virology:1–18.
4. Anonymous. 2022. SARS-CoV-2 Variant Classifications and Definitions, on Centers for Disease Control and Prevention. https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html?CDC_AA_refVal=https%3A%2F%2Fwww.cdc.gov%2Fcoronavirus%2F2019-ncov%2Fvariants%2Fvariant-info.html. Accessed 09/06/2022.
5. Defining the risk of SARS-CoV-2 variants on immune protection
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献