RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation-Reference-Cited by-同舟云学术

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

Published:2020-12-03 Issue:D1 Volume:49 Page:D1020-D1028
ISSN:0305-1048
Container-title:Nucleic Acids Research
language:en
Short-container-title:

Author:

Li Wenjun¹^ORCID,O’Neill Kathleen R¹,Haft Daniel H¹,DiCuccio Michael¹,Chetvernin Vyacheslav¹,Badretdin Azat¹,Coulouris George¹,Chitsaz Farideh¹,Derbyshire Myra K¹,Durkin A Scott¹,Gonzales Noreen R¹,Gwadz Marc¹,Lanczycki Christopher J¹,Song James S¹,Thanki Narmada¹,Wang Jiyao¹,Yamashita Roxanne A¹,Yang Mingzhang¹,Zheng Chanjuan¹,Marchler-Bauer Aron¹,Thibaud-Nissen Françoise¹

Affiliation:

1. National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892-6511, USA

Abstract

Abstract The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.

Funder

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Genetics

Link

http://academic.oup.com/nar/article-pdf/49/D1/D1020/35364279/gkaa1105.pdf

Reference26 articles.

1. NCBI prokaryotic genome annotation pipeline;Tatusova;Nucleic Acids Res.,2016

2. RefSeq: an update on prokaryotic genome annotation and curation;Haft;Nucleic Acids Res.,2018

3. The Transporter Classification Database (TCDB): recent advances;Saier;Nucleic Acids Res.,2016