CAREx: context-aware read extension of paired-end sequencing data
-
Published:2024-05-10
Issue:1
Volume:25
Page:
-
ISSN:1471-2105
-
Container-title:BMC Bioinformatics
-
language:en
-
Short-container-title:BMC Bioinformatics
Author:
Kallenborn FelixORCID, Schmidt BertilORCID
Abstract
Abstract
Background
Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads.
Results
We present CAREx—an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to $$99\%$$
99
%
for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools.
Conclusion
CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at (https://github.com/fkallen/CAREx).
Funder
Johannes Gutenberg-Universität Mainz
Publisher
Springer Science and Business Media LLC
Reference30 articles.
1. Byrska-Bishop M, Evani US, Zhao X, Basile AO, Abel HJ, Regier AA, Corvelo A, Clarke WE, Musunuri R, Nagulapalli K, Fairley S, Runnels A, Winterkorn L, Lowy E, Eichler EE, Korbel JO, Lee C, Marschall T, Devine SE, Harvey WT, Zhou W, Mills RE, Rausch T, Kumar S, Alkan C, Hormozdiari F, Chong Z, Chen Y, Yang X, Lin J, Gerstein MB, Kai Y, Zhu Q, Yilmaz F, Xiao C, Flicek Paul, Germer S, Brand H, Hall IM, Talkowski ME, Narzisi G, Zody MC. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell. 2022;185(18):3426–344019. https://doi.org/10.1016/j.cell.2022.08.004. 2. Hammond SA, Warren RL, Vandervalk BP, Kucuk E, Khan H, Gibb EA, Pandoh P, Kirk H, Zhao Y, Jones M, et al. The North American bullfrog draft genome provides insight into hormonal regulation of long noncoding RNA. Nat Commun. 2017;8(1):1–8. 3. Sim M, Lee J, Wy S, Park N, Lee D, Kwon D, Kim J. Generation and application of pseudo-long reads for metagenome assembly. GigaScience. 2022;11:giac044. 4. Vandervalk BP, Yang C, Xue Z, Raghavan K, Chu J, Mohamadi H, Jackman SD, Chiu R, Warren RL, Birol I. Konnector v2.0: pseudo-long reads from paired-end sequencing data. BMC Med Genomics. 2015;8(3):1. 5. Magoč T, Salzberg SL. FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011;27(21):2957–63. https://doi.org/10.1093/bioinformatics/btr507.
|
|