Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes-Reference-Cited by-同舟云学术

Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes

Published:2021-12-03 Issue:12 Volume:13 Page:2426
ISSN:1999-4915
Container-title:Viruses
language:en
Short-container-title:Viruses

Author:

Beck Kristen L.^ORCID,Seabolt Edward^ORCID,Agarwal Akshay^ORCID,Nayar Gowri^ORCID,Bianco Simone^ORCID,Krishnareddy Harsha^ORCID,Ngo Timothy A.^ORCID,Kunitomi Mark^ORCID,Mukherjee Vandana^ORCID,Kaufman James H.^ORCID

Abstract

SARS-CoV-2 genomic sequencing efforts have scaled dramatically to address the current global pandemic and aid public health. However, autonomous genome annotation of SARS-CoV-2 genes, proteins, and domains is not readily accomplished by existing methods and results in missing or incorrect sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that differentiates itself by not relying on the use of a single reference genome and by overcoming atypical genomic traits that challenge traditional bioinformatic methods. We analyzed an initial corpus of 66,000 SARS-CoV-2 genome sequences collected from labs across the world using our method and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to other published tools, such as Prokka (base) and VAPiD, we yielded a 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved across time and geography and others representing emerging variants. We observed 3362 non-redundant sequences per protein on average within this corpus and described key D614G and N501Y variants spatiotemporally in the initial genome corpus. For spike glycoprotein domains, we achieved greater than 97.9% sequence identity to references and characterized receptor binding domain variants. We further demonstrated the robustness and extensibility of our method on an additional 4000 variant diverse genomes containing all named variants of concern and interest as of August 2021. In this cohort, we successfully identified all keystone spike glycoprotein mutations in our predicted protein sequences with greater than 99% accuracy as well as demonstrating high accuracy of the protein and domain annotations. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.

Publisher

MDPI AG

Subject

Virology,Infectious Diseases

Link

https://www.mdpi.com/1999-4915/13/12/2426/pdf

Reference38 articles.

1. A new coronavirus associated with human respiratory disease in China

2. The Proteins of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS CoV-2 or n-COV19), the Cause of COVID-19

3. The UCSC SARS-CoV-2 Genome Browser

4. Genomic determinants of pathogenicity in SARS-CoV-2 and other human coronaviruses

5. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Application of advanced bioimaging technologies in viral infections;Materials Today Physics;2024-08

2. SARS-CoV-2 Next Generation Sequencing (NGS) data from clinical isolates from the East Texas Region of the United States;Data in Brief;2023-08

3. Predicting Epitope Candidates for SARS-CoV-2;Viruses;2022-08-21

4. Confirming Multiplex RT-qPCR Use in COVID-19 with Next-Generation Sequencing: Strategies for Epidemiological Advantage;Global Health;2022-07-30

5. Special Issue “Emerging Viruses 2021: Surveillance, Prevention, Evolution and Control”;Viruses;2022-04-15