KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate-Reference-Cited by-同舟云学术

KIMI: Knockoff Inference for Motif Identification from molecular sequences with controlled false discovery rate

Published:2020-10-29 Issue:6 Volume:37 Page:759-766
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Bai Xin¹^ORCID,Ren Jie¹,Fan Yingying²,Sun Fengzhu¹

Affiliation:

1. Quantitative and Computational Biology Program, Department of Biological Sciences, Los Angeles, CA 90089, USA

2. Data Sciences and Operations Department, Marshall School of Business, University of Southern California, Los Angeles, CA 90089, USA

Abstract

Abstract Motivation The rapid development of sequencing technologies has enabled us to generate a large number of metagenomic reads from genetic materials in microbial communities, making it possible to gain deep insights into understanding the differences between the genetic materials of different groups of microorganisms, such as bacteria, viruses, plasmids, etc. Computational methods based on k-mer frequencies have been shown to be highly effective for classifying metagenomic sequencing reads into different groups. However, such methods usually use all the k-mers as features for prediction without selecting relevant k-mers for the different groups of sequences, i.e. unique nucleotide patterns containing biological significance. Results To select k-mers for distinguishing different groups of sequences with guaranteed false discovery rate (FDR) control, we develop KIMI, a general framework based on model-X Knockoffs regarded as the state-of-the-art statistical method for FDR control, for sequence motif discovery with arbitrary target FDR level, such that reproducibility can be theoretically guaranteed. KIMI is shown through simulation studies to be effective in simultaneously controlling FDR and yielding high power, outperforming the broadly used Benjamini–Hochberg procedure and the q-value method for FDR control. To illustrate the usefulness of KIMI in analyzing real datasets, we take the viral motif discovery problem as an example and implement KIMI on a real dataset consisting of viral and bacterial contigs. We show that the accuracy of predicting viral and bacterial contigs can be increased by training the prediction model only on relevant k-mers selected by KIMI. Availabilityand implementation Our implementation of KIMI is available at https://github.com/xinbaiusc/KIMI. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

US National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa912/35064760/btaa912.pdf

Reference41 articles.

1. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity-and composition-based strategies;Akhter;Nucleic Acids Res,2012

2. A Markov analysis of DNA sequences;Almagor;J. Theor. Biol,1983

3. Virus population dynamics and acquired virus resistance in natural microbial communities;Andersson;Science,2008

4. Mono-through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis;Arnold;Nucleic Acids Res,1988

5. The analysis of intron data and their use in the detection of short signals;Avery;J. Mol. Evolu,1987

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DeepLINK: Deep learning inference using knockoffs with applications to genomics;P NATL ACAD SCI USA;2021

2. DeepLINK: Deep learning inference using knockoffs with applications to genomics;Proceedings of the National Academy of Sciences;2021-09-03