Author:
Mori Koichi,Ozaki Haruka,Fukunaga Tsukasa
Abstract
AbstractSequence motifs play essential roles in intermolecular interactions such as DNA-protein interactions. The discovery of novel sequence motifs is therefore crucial for revealing gene functions. Various bioinformatics tools have been developed for finding sequence motifs, but until now there has been no software based on statistical hypothesis testing with statistically sound multiple testing correction. Existing software therefore could not control for the type-1 error rates. This is because, in the sequence motif discovery problem, conventional multiple testing correction methods produce very low statistical power due to overly-strict correction. We developed MotiMul, which comprehensively finds significant sequence motifs using statistically sound multiple testing correction. Our key idea is the application of Tarone’s correction, which improves the statistical power of the hypothesis test by ignoring hypotheses that never become statistically significant. For the efficient enumeration of the significant sequence motifs, we integrated a variant of the PrefixSpan algorithm with Tarone’s correction. Simulation and empirical dataset analysis showed that MotiMul is a powerful method for finding biologically meaningful sequence motifs. The source code of MotiMul is freely available at https://github.com/ko-ichimo-ri/MotiMul.
Publisher
Cold Spring Harbor Laboratory
Reference37 articles.
1. JASPAR 2020: update of the open-access database of transcription factor binding profiles;Nucleic Acids Res,2020
2. Fitting a mixture model by expectation maximization to discover motifs in biopolymers;Proc. Int. Conf. Intell. Syst. Mol. Biol,1994
3. MEME SUITE: tools for motif discovery and searching
4. STEME: efficient EM to find motifs in large data sets
5. MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data;Comput. Biol. Chem.,2016