RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes-Reference-Cited by-同舟云学术

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes

Published:2023-11-14 Issue:D1 Volume:52 Page:D762-D769
ISSN:0305-1048
Container-title:Nucleic Acids Research
language:en
Short-container-title:

Author:

Haft Daniel H¹^ORCID,Badretdin Azat¹,Coulouris George¹,DiCuccio Michael¹,Durkin A Scott¹,Jovenitti Eric¹,Li Wenjun¹,Mersha Megdelawit¹,O’Neill Kathleen R¹,Virothaisakun Joel¹,Thibaud-Nissen Françoise¹

Affiliation:

1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health , Bethesda , MD 20894 , USA

Abstract

Abstract The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.

Funder

National Library of Medicine

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Genetics

Link

https://academic.oup.com/nar/article-pdf/52/D1/D762/55041263/gkad988.pdf

Reference40 articles.

1. The international nucleotide sequence database collaboration;Arita;Nucleic Acids Res.,2021

2. The European Nucleotide Archive in 2022;Burgin;Nucleic Acids Res.,2023

3. DNA Data Bank of Japan (DDBJ) update report 2022;Tanizawa;Nucleic Acids Res.,2023

4. GenBank;Sayers;Nucleic Acids Res.,2022

5. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life;Parks;Nat. Microbiol.,2017

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The second messenger c-di-AMP controls natural competence via ComFB signaling protein;2023-11-27

2. Database resources of the National Center for Biotechnology Information;Nucleic Acids Research;2023-11-22