Abstract
AbstractMotivationSequence alignment has been at the core of computational biology for half a century. Still, it is an open problem to design a practical algorithm for exact alignment of a pair of related sequences in linear-like time (Medvedev, 2022b).MethodsWe solve exact global pairwise alignment with respect to edit distance by using the A* shortest path algorithm. In order to efficiently align long sequences with high divergence, we extend the recently proposedseed heuristic(Ivanovet al., 2022) withmatch chaining, gap costs, andinexact matches. We additionally integrate the novelmatch pruningtechnique and diagonal transition (Ukkonen, 1985) to improve the A* search. We prove the correctness of our algorithm, implement it in the A*PA aligner, and justify our extensions intuitively and empirically.ResultsOn random sequences of divergenced=4% and lengthn, the empirical runtime of A*PA scales near-linearly (best fitn1.07,n≤ 107bp). A similar scaling remains up tod=12% (best fitn1.24,n≤ 107bp). Forn= 107bp andd=4, A*PA reaches >300× speedup compared to the leading exact aligners EDLIB and BIWFA. The performance of A*PA is highly influenced by long gaps. On long (n>500 kbp) ONT reads of a human sample it efficiently aligns sequences withd<10%, leading to 2× median speedup compared to EDLIB and BIWFA. When the sequences come from different human samples, A*PA performs 1.4× faster than EDLIB and BIWFA.Availabilitygithub.com/RagnarGrootKoerkamp/astar-pairwise-alignerContactragnar.grootkoerkamp@inf.ethz.ch,pesho@inf.ethz.ch
Publisher
Cold Spring Harbor Laboratory
Reference46 articles.
1. Allison, L. (1992). Lazy dynamic-programming can be eager. Information Processing Letters.
2. Backurs, A. and Indyk, P. (2015). Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 51–58.
3. Benson, G. , Levy, A. , and Shalom, R. (2014). Longest common subsequence in k-length substrings.
4. Lcsk: a refined similarity measure;Theoretical Computer Science,2016
5. Bertsekas, D. P. (1991). Linear network optimization: algorithms and codes. MIT Press.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献