Author:
Cunial Fabio,Denas Olgert,Belazzougui Djamal
Abstract
AbstractMotivationFast, lightweight methods for comparing the sequence of ever larger assembled genomes from ever growing databases are increasingly needed in the era of accurate long reads and pan-genome initiatives. Matching statistics is a popular method for computing whole-genome phylogenies and for detecting structural rearrangements between two genomes, since it is amenable to fast implementations that require a minimal setup of data structures. However, current implementations use a single core, take too much memory to represent the result, and do not provide efficient ways to analyze the output in order to explore local similarities between the sequences.ResultsWe develop practical tools for computing matching statistics between large-scale strings, and for analyzing its values, faster and using less memory than the state of the art. Specifically, we design a parallel algorithm for shared-memory machines that computes matching statistics 30 times faster with 48 cores in the cases that are most difficult to parallelize. We design a lossy compression scheme that shrinks the matching statistics array to a bitvector that takes from 0.8 to 0.2 bits per character, depending on the dataset and on the value of a threshold, and that achieves 0.04 bits per character in some variants. And we provide efficient implementations of range-maximum and range-sum queries that take a few tens of milliseconds while operating on our compact representations, and that allow computing key local statistics about the similarity between two strings. Our toolkit makes construction, storage, and analysis of matching statistics arrays practical for multiple pairs of the largest genomes available today, possibly enabling new applications in comparative genomics.Availability ad implementationOur C/C++ code is available at https://github.com/odenas/indexed_ms under GPL-3.0.
Publisher
Cold Spring Harbor Laboratory
Reference44 articles.
1. Omar Ahmed , Massimiliano Rossi , Sam Kovaka , Michael C Schatz , Travis Gagie , Christina Boucher , and Ben Langmead . Pan-genomic matching statistics for targeted Nanopore sequencing. iScience, page 102696, 2021.
2. Sequence similarity measures based on bounded Hamming distance;Theoretical Computer Science,2016
3. Djamal Belazzougui and Fabio Cunial . Indexed matching statistics and shortest unique substrings. In International Symposium on String Processing and Information Retrieval, pages 179–190. Springer, 2014.
4. Djamal Belazzougui , Fabio Cunial , and Olgert Denas . Fast matching statistics in small space. In Proceedings of the 17th International Symposium on Experimental Algorithms (SEA 2018), volume 103. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
5. Antonio Boffa , Paolo Ferragina , and Giorgio Vinciguerra . A “learned” approach to quicken and compress rank/select dictionaries. In 2021 Proceedings of the Workshop on Algorithm Engineering and Giuseppa Castiglione, Experiments (ALENEX), pages 46–59. SIAM, 2021.