OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches

Author:

Rossier Victor123ORCID,Warwick Vesztrocy Alex123ORCID,Robinson-Rechavi Marc34ORCID,Dessimoz Christophe12356ORCID

Affiliation:

1. Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland

2. Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland

3. SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland

4. Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland

5. Department of Genetics, Evolution, and Environment, University College London, London, WC1E 6BT, UK

6. Department of Computer Science, University College London, London, WC1E 6BT, UK

Abstract

Abstract Motivation Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. Results Here, we first show that in multiple animal and plant datasets, 18–62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. Availabilityand implementation OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Swiss National Foundation

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Reference42 articles.

1. Basic local alignment search tool;Altschul;J. Mol. Biol,1990

2. EPA-ng: massively parallel evolutionary placement of genetic sequences;Barbera;Syst. Biol,2018

3. Spaced seeds improve k-mer-based metagenomic classification;Břinda;Bioinformatics,2015

4. Fast and sensitive protein alignment using DIAMOND;Buchfink;Nat. Methods,2015

5. Turning a hobby into a job: how duplicated genes find new functions;Conant;Nat. Rev. Genet,2008

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3