Abstract
AbstractHigh-throughput sequencing-based assays measure different biochemical activities pertaining to gene regulation, genome-wide. These activities include protein-DNA binding, enhancer-activity, open chromatin, and more. A major goal is to understand underlying sequence components, or motifs, that can explain the measured activity. It is usually not one motif, but a combination of motifs bound by cooperatively acting proteins that confers activity to such regions. Furthermore, although having a single type of activity, the regions can still be diverse, governed by different combinations of proteins/motifs. Current approaches do not take into account this issue of combinatorial diversity. We present a new statistical framework cisDiversity, which models regions as diverse modules characterized by combinations of motifs, while simultaneously learning the motifs themselves. We show that ChIP-seq data for the CTCF protein in fly contains diverse sequence structures, with most direct CTCF-binding sites situated far from promoters, giving insights into its co-factors and potential role in looping. Human CTCF-bound regions, on the other hand, have a different architecture. Because cisDiversity does not rely on knowledge of motifs, modules, cell-type, or organism, it is general enough to be applied to regions reported by most high-throughput assays. Indeed, enhancer predictions resulting from different assays—GRO-cap, STARR-seq, and those measuring chromatin structure—show distinct modules and combinations of TF binding sites, some specific to the assay. No module occurs universally in all enhancer-assays. Finally, analysis of accessible chromatin suggests that regions open in one cell-state encode information about future states, with certain modules staying open and others closing down later. The code is freely available at https://github.com/NarlikarLab/cisDIVERSITY.
Publisher
Cold Spring Harbor Laboratory