SparkEC: speeding up alignment-based DNA error correction tools-Reference-Cited by-同舟云学术

SparkEC: speeding up alignment-based DNA error correction tools

Published:2022-11-07 Issue:1 Volume:23 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Expósito Roberto R.^ORCID,Martínez-Sánchez Marco,Touriño Juan

Abstract

Abstract Background In recent years, huge improvements have been made in the context of sequencing genomic data under what is called Next Generation Sequencing (NGS). However, the DNA reads generated by current NGS platforms are not free of errors, which can affect the quality of downstream analysis. Although error correction can be performed as a preprocessing step to overcome this issue, it usually requires long computational times to analyze those large datasets generated nowadays through NGS. Therefore, new software capable of scaling out on a cluster of nodes with high performance is of great importance. Results In this paper, we present SparkEC, a parallel tool capable of fixing those errors produced during the sequencing process. For this purpose, the algorithms proposed by the CloudEC tool, which is already proved to perform accurate corrections, have been analyzed and optimized to improve their performance by relying on the Apache Spark framework together with the introduction of other enhancements such as the usage of memory-efficient data structures and the avoidance of any input preprocessing. The experimental results have shown significant improvements in the computational times of SparkEC when compared to CloudEC for all the representative datasets and scenarios under evaluation, providing an average and maximum speedups of 4.9

$$\times$$

× and 11.9

$$\times$$

× , respectively, over its counterpart. Conclusion As error correction can take excessive computational time, SparkEC provides a scalable solution for correcting large datasets. Due to its distributed implementation, SparkEC speed can increase with respect to the number of nodes in a cluster. Furthermore, the software is freely available under GPLv3 license and is compatible with different operating systems (Linux, Windows and macOS).

Funder

Ministerio de Ciencia e Innovación

Consellería de Cultura, Educación e Ordenación Universitaria, Xunta de Galicia

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-022-05013-1.pdf

Reference36 articles.

1. van Dijk EL, Auger H, Jaszczyszyn Y, Thermes C. Ten years of next-generation sequencing technology. Trends Genet. 2014;30(9):418–26.

2. Alic AS, Ruzafa D, Dopazo J, Blanquer I. Objective review of de novo stand-alone error correction methods for NGS data. WIREs Comput Mol Sci. 2016;6(2):111–46.

3. Heydari M, Miclotte G, Demeester P, Van de Peer Y, Fostier J. Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinform. 2017;18(1):374.

4. Chung W, Ho J, Lin C, Lee DT. CloudEC: a MapReduce-based algorithm for correcting errors in NGS data. [Online]. https://github.com/CSCLabTW/CloudEC. Accessed 15 Sept 2022.

5. Lämmel R. Google’s MapReduce programming model-Revisited. Sci Comput Program. 2008;70(1):1–30.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Integration of hybrid and self-correction method improves the quality of long-read sequencing data;Briefings in Functional Genomics;2023-06-20

2. Framing Apache Spark in life sciences;Heliyon;2023-02