Improved selection of canonical proteins for reference proteomes-Reference-Cited by-同舟云学术

Improved selection of canonical proteins for reference proteomes

Published:2024-04-04 Issue:2 Volume:6 Page:
ISSN:2631-9268
Container-title:NAR Genomics and Bioinformatics
language:en
Short-container-title:

Author:

Insana Giuseppe¹^ORCID,Martin Maria J¹^ORCID,Pearson William R²^ORCID

Affiliation:

1. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) , Wellcome Genome Campus, Hinxton CB10 1SD, UK

2. Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine , Charlottesville, VA 22908, USA

Abstract

Abstract The ‘canonical’ protein sets distributed by UniProt are widely used for similarity searching, and functional and structural annotation. For many investigators, canonical sequences are the only version of a protein examined. However, higher eukaryotes often encode multiple isoforms of a protein from a single gene. For unreviewed (UniProtKB/TrEMBL) protein sequences, the longest sequence in a Gene-Centric group is chosen as canonical. This choice can create inconsistencies, selecting >95% identical orthologs with dramatically different lengths, which is biologically unlikely. We describe the ortho2tree pipeline, which examines Reference Proteome canonical and isoform sequences from sets of orthologous proteins, builds multiple alignments, constructs gap-distance trees, and identifies low-cost clades of isoforms with similar lengths. After examining 140 000 proteins from eight mammals in UniProtKB release 2022_05, ortho2tree proposed 7804 canonical changes for release 2023_01, while confirming 53 434 canonicals. Gap distributions for isoforms selected by ortho2tree are similar to those in bacterial and yeast alignments, organisms unaffected by isoform selection, suggesting ortho2tree canonicals more accurately reflect genuine biological variation. 82% of ortho2tree proposed-changes agreed with MANE; for confirmed canonicals, 92% agreed with MANE. Ortho2tree can improve canonical assignment among orthologous sequences that are >60% identical, a group that includes vertebrates and higher plants.

Funder

National Science Foundation

European Molecular Biology Laboratory

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/nargab/article-pdf/6/2/lqae066/58200385/lqae066.pdf

Reference28 articles.

1. Viral src gene products are related to the catalytic chain of mammalian camp-dependent protein kinase;Barker;Proc. Natl. Acad. Sci. U.S.A.,1982

2. Simian sarcoma virus onc gene, v-sis, is derived from the gene (or genes) encoding a platelet-derived growth factor;Doolittle;Science,1983

3. Creating the gene ontology resource: design and implementation;Gene Ontology Consortium;Genome Res.,2001

4. A basic local alignment search tool;Altschul;J. Mol. Biol.,1990

5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs;Altschul;Nucleic Acids Res.,1997