Abstract
ABSTRACTMouse is probably the most important model organism to study mammal biology and human diseases. A better understanding of the mouse genome will help understand the human genome, biology and diseases. However, despite the recent progress, characterization of the regulatory sequences in the mouse genome is still far from complete, limiting its use to understand the regulatory sequences in the human genome. Here, by integrating binding peaks in 9,060 transcription factor (TF) ChIP-seq datasets that cover 79.9% of the mouse mappable genome using an efficient pipeline, we were able to partition these binding peak-covered genome regions into a cis-regulatory module (CRM) candidate (CRMC) set and a non-CRMC sets. The CRMCs contain 912,197 putative CRMs and 38,554,729 TF binding sites (TFBSs) islands, covering 55.5% and 24.4% of the mappable genome, respectively. The CRMCs tend to be under strongly evolutionary constraints, indicating that they are likely cis-regulatory; while the non-CRMCs are largely selectively neutral, indicating that they are unlikely cis-regulatory. Based on evolutionary profiles of the genome positions, we further estimated that 63.8% and 27.4% of the mouse genome might code for CRMs and TFBSs, respectively. Validation using experimental data suggests that at least most of the CRMCs are authentic. Thus, this unprecedentedly comprehensive map of CRMs and TFBSs can be a good resource to guide experimental studies of regulatory genomes in mice and humans.
Publisher
Cold Spring Harbor Laboratory