ANALYSIS OF CONTEXT-DEPENDENT ERRORS FOR ILLUMINA SEQUENCING-Reference-Cited by-同舟云学术

ANALYSIS OF CONTEXT-DEPENDENT ERRORS FOR ILLUMINA SEQUENCING

Published:2012-04 Issue:02 Volume:10 Page:1241005
ISSN:0219-7200
Container-title:Journal of Bioinformatics and Computational Biology
language:en
Short-container-title:J. Bioinform. Comput. Biol.

Author:

ABNIZOVA IRINA¹,LEONARD STEVEN¹,SKELLY TOM¹,BROWN ANDY¹,JACKSON DAVID¹,GOURTOVAIA MARINA¹,QI GUOYING¹,TE BOEKHORST RENE¹,FARUQUE NADEEM¹,LEWIS KEVIN¹,COX TONY¹

Affiliation:

1. Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, UK

Abstract

The new generation of short-read sequencing technologies requires reliable measures of data quality. Such measures are especially important for variant calling. However, in the particular case of SNP calling, a great number of false-positive SNPs may be obtained. One needs to distinguish putative SNPs from sequencing or other errors. We found that not only the probability of sequencing errors (i.e. the quality value) is important to distinguish an FP-SNP but also the conditional probability of "correcting" this error (the "second best call" probability, conditional on that of the first call). Surprisingly, around 80% of mismatches can be "corrected" with this second call. Another way to reduce the rate of FP-SNPs is to retrieve DNA motifs that seem to be prone to sequencing errors, and to attach a corresponding conditional quality value to these motifs. We have developed several measures to distinguish between sequence errors and candidate SNPs, based on a base call's nucleotide context and its mismatch type. In addition, we suggested a simple method to correct the majority of mismatches, based on conditional probability of their "second" best intensity call. We attach a corresponding second call confidence (quality value) of being corrected to each mismatch.

Publisher

World Scientific Pub Co Pte Lt

Subject

Computer Science Applications,Molecular Biology,Biochemistry

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0219720012410053

Reference12 articles.

1. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing

2. Nucleotide sequence of bacteriophage φX174 DNA

3. Genomic and Genetic Analysis of Bordetella Bacteriophages Encoding Reverse Transcriptase-Mediated Tropism-Switching Cassettes

4. Genome sequence of the human malaria parasite Plasmodium falciparum

Cited by 18 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enhancing metagenomic classification with compression-based features;Artificial Intelligence in Medicine;2024-10

2. Classifying and discovering genomic sequences in metagenomic repositories;Procedia Computer Science;2023

3. The value of compression for taxonomic identification;2022 IEEE 35th International Symposium on Computer-Based Medical Systems (CBMS);2022-07

4. Feature-Based Classification of Archaeal Sequences Using Compression-Based Methods;Pattern Recognition and Image Analysis;2022

5. ESREEM: Efficient Short Reads Error Estimation Computational Model for Next-generation Genome Sequencing;Current Bioinformatics;2021-04-30