SeqWho: Reliable, rapid determination of sequence file identity using k-mer frequencies-Reference-Cited by-同舟云学术

SeqWho: Reliable, rapid determination of sequence file identity using k-mer frequencies

Published:2021-03-11 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bennett Christopher,Thornton Micah,Park Chanhee,Henry Gervaise,Zhang Yun,Malladi Venkat S.,Kim Daehwan

Abstract

AbstractWith the vast improvements in sequencing technologies and increased number of protocols, sequencing is finding more applications to answer complex biological problems. Thus, the amount of publicly available sequencing data has tremendously increased in repositories such as SRA, EGA, and ENCODE. With any large online database, there is a critical need to accurately document study metadata, such as the source protocol and organism. In some cases, this metadata may not be systematically verified by the hosting sites and may result in a negative influence on future studies. Here we present SeqWho, a program designed to heuristically assess the quality of sequencing files and reliably classify the organism and protocol type. This is done in an alignment-free algorithm that leverages a Random Forest classifier to learn from native biases in k-mer frequencies and repeat sequence identities between different sequencing technologies and species. Here, we show that our method can accurately and rapidly distinguish between human and mouse, nine different sequencing technologies, and both together, 98.32%, 97.86%, and 96.38% of the time in high confidence calls respectively. This demonstrates that SeqWho is a powerful method for reliably checking the identity of the sequencing files used in any pipeline and illustrates the program’s ability to leverage k-mer biases.

Publisher

Cold Spring Harbor Laboratory

Reference19 articles.

1. The future of DNA sequencing;Nature,2017

2. Genomes OnLine database (GOLD) v.7: updates and new features

3. The real cost of sequencing: scaling computation to keep pace with data generation

4. An integrated encyclopedia of DNA elements in the human genome

5. The sequence read archive: Explosive growth of sequencing data;Nucleic Acids Res,2012