Abstract
AbstractThe current state-of-the-art assemblers of long, error-prone reads rely on detecting all-vs-all overlaps within the set of reads with overlaps represented by a sparse selection of short subsequences or “seeds”. Though the quality of selection of these seeds can impact both accuracy and speed of overlap detection, existing algorithms do little more than ignore over-represented seeds. Here we propose several more informed seed selection strategies to improve precision and recall of overlaps. These strategies are evaluated against real long-read data sets with a range of fixed seed sizes. We show that these strategies substantially improve the utility of individual seeds over uninformed selection.
Publisher
Cold Spring Harbor Laboratory
Reference10 articles.
1. Konstantin Berlin , Sergey Koren , Chen-Shan Chin , James P Drake , Jane M Landolin , and Phillippy Adam M Phillippy . Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nature Biotechnology, 33, 2015.
2. Antonio Bernardo Carvalho , Eduardo G. Dupim , and Gabriel Goldstein . Improved assembly of noisy long reads by k-mer validation. Genome Research, pages 1710–1720, 2016.
3. Benchmarking of de novo assembly algorithms for nanopore data reveals optimal performance of olc approaches;BMC Genomics,2016
4. Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art;Bioinformatics,2017
5. Hans J. Jansen , Michael Liem Suzanne A. Jong-Raadsen , Sylvie Dufour , Finn-Arne Weltzien , William Swinkels , Alex Koelewijn , Arjan P. Palstra , Bernd Pelster , Herman P. Spaink , Guido E. Van den Thillart , Ron P. Dirks , and Christiaan V. Henkel . Rapid de novo assembly of the european eel genome from nanopore sequencing reads. bioRxiv, 2017.