Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time


Cairo Massimo1ORCID,Rizzi Romeo2ORCID,Tomescu Alexandru I.1ORCID,Zirondelli Elia C.3ORCID


1. Department of Computer Science, University of Helsinki, Finland

2. Department of Computer Science, University of Verona, Italy

3. Department of Mathematics, University of Trento, Italy and Department of Computer Science, University of Verona, Italy


Genome assembly asks to reconstruct an unknown string from many shorter substrings of it. Even though it is one of the key problems in Bioinformatics, it is generally lacking major theoretical advances. Its hardness stems both from practical issues (size and errors of real data), and from the fact that problem formulations inherently admit multiple solutions. Given these, at their core, most state-of-the-art assemblers are based on finding non-branching paths ( unitigs ) in an assembly graph. While such paths constitute only partial assemblies, they are likely to be correct. More precisely, if one defines a genome assembly solution as a closed arc-covering walk of the graph, then unitigs appear in all solutions, being thus safe partial solutions. Until recently, it was open what are all the safe walks of an assembly graph. Tomescu and Medvedev (RECOMB 2016) characterized all such safe walks ( omnitigs ), thus giving the first safe and complete genome assembly algorithm. Even though maximal omnitig finding was later improved to quadratic time by Cairo et al. (ACM Trans. Algorithms 2019), it remained open whether the crucial linear-time feature of finding unitigs can be attained with omnitigs. We answer this question affirmatively, by describing a surprising O(m) -time algorithm to identify all maximal omnitigs of a graph with n nodes and m arcs, notwithstanding the existence of families of graphs with Θ (mn) total maximal omnitig size. This is based on the discovery of a family of walks ( macrotigs ) with the property that all the non-trivial omnitigs are univocal extensions of subwalks of a macrotig. This has two consequences: (1) A linear-time output-sensitive algorithm enumerating all maximal omnitigs. (2) A compact O(m) representation of all maximal omnitigs, which allows, e.g., for O(m) -time computation of various statistics on them. Our results close a long-standing theoretical question inspired by practical genome assemblers, originating with the use of unitigs in 1995. We envision our results to be at the core of a reverse transfer from theory to practical and complete genome assembly programs, as has been the case for other key Bioinformatics problems.


European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme

Academy of Finland


Association for Computing Machinery (ACM)


Mathematics (miscellaneous)

Reference61 articles.

1. Tight Hardness Results for LCS and Other Sequence Similarity Measures

2. A safe and complete algorithm for metagenomic assembly

3. Basic local alignment search tool;Altschul Stephen F.;Journal of Molecular Biology,1990

4. Arturs Backurs and Piotr Indyk. 2015. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015, Rocco A. Servedio and Ronitt Rubinfeld (Eds.). ACM, Portland, OR, USA, 51–58. DOI:10.1145/2746539.2746612

5. Which Regular Expression Patterns Are Hard to Match?







Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3