Abstract
ABSTRACTPredicting cis-regulatory modules(CRMs) in a genome and predicting their functional states in various cell/tissue types of the organism are two related challenging computational tasks. Most current methods attempt to achieve both simultaneously using epigenetic data. Though conceptually attractive, they suffer high false discovery rates and limited applications. To fill the gaps, we proposed a two-step strategy to first predict a map of CRMs in the genome, and then predict functional states of the CRMs in various cell/tissue types of the organism. We have recently developed an algorithm for accurately predicting CRMs in a genome by integrating numerous transcription factor ChIP-seq datasets. Here, we showed that only three or four epigenetic marks data in a cell/tissue type were sufficient for a machine-learning model to accurately predict functional states of all CRMs. Our predictions are substantially more accurate than the best achieved so far. Interestingly, a model trained on different cell/tissue types in a mammal can accurately predict functional states of CRMs in different cell/tissue types of the mammal as well as in various cell/tissue types of a different mammal. Therefore, epigenetic code that defines functional states of CRMs in various cell/tissue types is universal at least in mammals. Moreover, we found that from tens to hundreds of thousands of CRMs were active in a human and mouse cell/tissue type, and up to 99.98% of them were reutilized in different cell/tissue types, while as small as 0.02% of them were unique to a cell/tissue type that might define the cell/tissue type.
Publisher
Cold Spring Harbor Laboratory