Error analysis of the PacBio sequencing CCS reads-Reference-Cited by-同舟云学术

Error analysis of the PacBio sequencing CCS reads

Published:2023-05-08 Issue:2 Volume:19 Page:439-453
ISSN:1557-4679
Container-title:The International Journal of Biostatistics
language:en
Short-container-title:

Author:

Pourmohammadi Reza¹,Abouei Jamshid¹,Anpalagan Alagan²

Affiliation:

1. WINEL Research Laboratory at the Department of Electrical Engineering , Yazd University , Yazd , Iran

2. Department of Electrical , Computer and Biomedical Engineering, Ryerson University , Toronto , Canada

Abstract

Abstract Third generation sequencing technologies such as Pacific Biosciences and Oxford Nanopore provide faster, cost-effective and simpler assembly process generating longer reads than the ones in the next generation sequencing. However, the error rates of these long reads are higher than those of the short reads, resulting in an error correcting process before the assembly such as using the Circular Consensus Sequencing (CCS) reads in PacBio sequencing machines. In this paper, we propose a probabilistic model for the error occurrence along the CCS reads. We obtain the error probability of any arbitrary nucleotide as well as the base calling Phred quality score of the nucleotides along the CCS reads in terms of the number of sub-reads. Furthermore, we derive the error rate distribution of the reads in relation to the pass number. It follows the binomial distribution which can be approximated by the normal distribution for long reads. Finally, we evaluate our proposed model by comparing it with three real PacBio datasets, namely, Lambda, and E. coli genomes, and Alzheimer’s disease targeted experiment.

Publisher

Walter de Gruyter GmbH

Subject

Statistics, Probability and Uncertainty,General Medicine,Statistics and Probability

Link

https://www.degruyter.com/document/doi/10.1515/ijb-2021-0091/pdf

Reference27 articles.

1. Pourmohammadi, R, Abouei, J, Anpalagan, A. Probabilistic modeling and analysis of DNA fragmentation. J Biol Syst 2019;27:281–307. https://doi.org/10.1142/s0218339019500128.

2. van Dijk, EL, Jaszczyszyn, Y, Naquin, D, Thermes, C. The third revolution in sequencing technology. Trends Genet 2018;34:666–81. https://doi.org/10.1016/j.tig.2018.05.008.

3. Johnson, SS, Zaikova, E, Goerlitz, DS, Bai, Y, Tighe, SW. Real-time DNA sequencing in the antarctic dry valleys using the Oxford Nanopore sequencer. J Biomol Tech 2017;28:2–7. https://doi.org/10.7171/jbt.17-2801-009.

4. Jiao, X, Zheng, X, Ma, L, Kutty, G, Gogineni, E, Sun, Q, et al.. A benchmark study on error assessment and quality control of CCS reads derived from the PacBio RS. J Data Min Genom Proteonomics 2013;4:1–5. https://doi.org/10.4172/2153-0602.1000136.

5. Koren, S, Schatz, MC, Walenz, BP, Martin, J, Howard, JT, Ganapathy, G, et al.. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 2012;30:693–700. https://doi.org/10.1038/nbt.2280.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning;GigaScience;2024