PanPA: generation and alignment of panproteome graphs

Author:

Dabbaghie Fawaz123ORCID,Srikakulam Sanjay K345ORCID,Marschall Tobias12ORCID,Kalinina Olga V367ORCID

Affiliation:

1. Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf , 40225 Düsseldorf, Germany

2. Center for Digital Medicine, Heinrich Heine University , 40225 Düsseldorf, Germany

3. Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Center for Infection Research (HZI) , Saarbrücken, Germany

4. Graduate School of Computer Science, Saarland University , 66123 Saarbrücken, Germany

5. Interdisciplinary Graduate School of Natural Product Research, Saarland University , 66123 Saarbrücken, Germany

6. Drug Bioinformatics, Medical Faculty, Saarland University , 66421 Homburg, Germany

7. Center for Bioinformatics, Saarland University , 66123 Saarbrücken, Germany

Abstract

Motivation Compared to eukaryotes, prokaryote genomes are more diverse through different mechanisms, including a higher mutation rate and horizontal gene transfer. Therefore, using a linear representative reference can cause a reference bias. Graph-based pangenome methods have been developed to tackle this problem. However, comparisons in DNA space are still challenging due to this high diversity. In contrast, amino acid sequences have higher similarity due to evolutionary constraints, whereby a single amino acid may be encoded by several synonymous codons. Coding regions cover the majority of the genome in prokaryotes. Thus, panproteomes present an attractive alternative leveraging the higher sequence similarity while not losing much of the genome in non-coding regions. Results We present PanPA, a method that takes a set of multiple sequence alignments of protein sequences, indexes them, and builds a graph for each multiple sequence alignment. In the querying step, it can align DNA or amino acid sequences back to these graphs. We first showcase that PanPA generates correct alignments on a panproteome from 1350 Escherichia coli. To demonstrate that panproteomes allow comparisons at longer phylogenetic distances, we compare DNA and protein alignments from 1073 Salmonella enterica assemblies against E.coli reference genome, pangenome, and panproteome using BWA, GraphAligner, and PanPA, respectively; with PanPA aligning around 22% more sequences. We also aligned a DNA short-reads whole genome sequencing (WGS) sample from S.enterica against the E.coli reference with BWA and the panproteome with PanPA, where PanPA was able to find alignment for 68% of the reads compared to 5% with BWA. Availalability and implementation PanPA is available at https://github.com/fawaz-dabbaghieh/PanPA.

Funder

Ministry of Culture and Science of the State of North Rhine-Westphalia

Klaus Faber Foundation

Publisher

Oxford University Press (OUP)

Subject

Computer Science Applications,Genetics,Molecular Biology,Structural Biology

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3