SeqWho: reliable, rapid determination of sequence file identity using <i>k</i>-mer frequencies in Random Forest classifiers-Reference-Cited by-同舟云学术

SeqWho: reliable, rapid determination of sequence file identity using k-mer frequencies in Random Forest classifiers

Published:2022-02-03 Issue:7 Volume:38 Page:1830-1837
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Bennett Christopher¹^ORCID,Thornton Micah¹^ORCID,Park Chanhee¹,Henry Gervaise²^ORCID,Zhang Yun¹,Malladi Venkat¹^ORCID,Kim Daehwan¹

Affiliation:

1. Lyda Hill Department of Bioinformatics, University of Texas Southwestern , Dallas, TX 75390, USA

2. Department of Urology, University of Texas Southwestern , Dallas, TX 75390, USA

Abstract

Abstract Motivation With the vast improvements in sequencing technologies and increased number of protocols, sequencing is being used to answer complex biological problems. Subsequently, analysis pipelines have become more time consuming and complicated, usually requiring highly extensive prevalidation steps. Here, we present SeqWho, a program designed to assess heuristically the quality of sequencing files and reliably classify the organism and protocol type by using Random Forest classifiers trained on biases native in k-mer frequencies and repeat sequence identities. Results Using one of our primary models, we show that our method accurately and rapidly classifies human and mouse sequences from nine different sequencing libraries by species, library and both together, 98.32%, 97.86% and 96.38% of the time, respectively. Ultimately, we demonstrate that SeqWho is a powerful method for reliably validating the quality and identity of the sequencing files used in any pipeline. Availability and implementation https://github.com/DaehwanKimLab/seqwho. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

National Institute of General Medical Sciences

NIH

Cancer Prevention Research Institute of Texas

CPRIT

Cancer Prevention and Research Institute of Texas

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btac050/42516088/btac050.pdf

Reference20 articles.

1. Babraham Bioinformatics—FastQC a Quality Control Tool for High Throughput Sequence Data;Andrews,2010

2. Automated detection of records in biological sequence databases that are inconsistent with the literature;Bouadjenek;J. Biomed. Inform,2017

3. Near-optimal probabilistic RNA-seq quantification;Bray;Nat. Biotechnol,2016

4. KrakenUniq: confident and fast metagenomics classification using unique k-mer counts;Breitwieser;Genome Biol,2018

5. Errors in genome annotation;Brenner;Trends Genet,1999