Abstract
AbstractMotivationPangenome variation graphs model the mutual alignment of collections of DNA sequences. A set of pairwise alignments implies a variation graph, but there are no scalable methods to generate such a graph from these alignments. Existing related approaches depend on a single reference, a specific ordering of genomes, or a de Bruijn model based on a fixed k-mer length. A scalable, self-contained method to build pangenome graphs without such limitations would be a key step in pangenome construction and manipulation pipelines.ResultsWe design the seqwish algorithm, which builds a variation graph from a set of sequences and alignments between them. We first transform the alignment set into an implicit interval tree. To build up the variation graph, we query this tree-based representation of the alignments to reduce transitive matches into single DNA segments in a sequence graph. By recording the mapping from input sequence to output graph, we can trace the original paths through this graph, yielding a pangenome variation graph. We present an implementation that operates in external memory, using disk-backed data structures and lock-free parallel methods to drive the core graph induction step. We demonstrate that our method scales to very large graph induction problems by applying it to build pangenome graphs for several species.Availabilityseqwish is published as free software under the MIT open source license. Source code and documentation are available at https://github.com/ekg/seqwish. seqwish can be installed via Bioconda https://bioconda.github.io/recipes/seqwish/README.html or GNU Guix https://github.com/ekg/guix-genomics/blob/master/seqwish.scm.Contactegarris5@uthsc.edu
Publisher
Cold Spring Harbor Laboratory
Reference39 articles.
1. Anderson, R. J. and Woll, H. (1991). Wait-free parallel algorithms for the union-find problem. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 370–380.
2. Progressive Cactus is a multiple-genome aligner for the thousand-genome era
3. In-Place Parallel Super Scalar Samplesort (IPSSSSo);25th Annual European Symposium on Algorithms (ESA 2017), volume 87 of Leibniz International Proceedings in Informatics (LIPIcs),2017
4. Computational pan-genomics: status, promises and challenges
5. Doerr, D. (2021 (accessed Jan 2022)). Gfaffix identifies walk-preserving shared affixes in variation graphs and collapses them into a non-redundant graph structure. https://github.com/marschall-lab/GFAffix.
Cited by
17 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献