iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models

Author:

Li Jiahao1,Wu Zhourun1,Lin Wenhao1,Luo Jiawei1,Zhang Jun1,Chen Qingcai12ORCID,Chen Junjie1ORCID

Affiliation:

1. School of Computer Science and Technology, Harbin Institute of Technology , Shenzhen, Guangdong 518055, China

2. Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology , Shenzhen, Guangdong 518055, China

Abstract

Abstract Motivation Enhancers are important cis-regulatory elements that regulate a wide range of biological functions and enhance the transcription of target genes. Although many feature extraction methods have been proposed to improve the performance of enhancer identification, they cannot learn position-related multiscale contextual information from raw DNA sequences. Results In this article, we propose a novel enhancer identification method (iEnhancer-ELM) based on BERT-like enhancer language models. iEnhancer-ELM tokenizes DNA sequences with multi-scale k-mers and extracts contextual information of different scale k-mers related with their positions via an multi-head attention mechanism. We first evaluate the performance of different scale k-mers, then ensemble them to improve the performance of enhancer identification. The experimental results on two popular benchmark datasets show that our model outperforms state-of-the-art methods. We further illustrate the interpretability of iEnhancer-ELM. For a case study, we discover 30 enhancer motifs via a 3-mer-based model, where 12 of motifs are verified by STREME and JASPAR, demonstrating our model has a potential ability to unveil the biological mechanism of enhancer. Availability and implementation The models and associated code are available at https://github.com/chen-bioinfo/iEnhancer-ELM Supplementary information Supplementary data are available at Bioinformatics Advances online.

Funder

Natural Science Foundation of China

Educational Commission of Guangdong Province of China

Publisher

Oxford University Press (OUP)

Subject

Cell Biology,Developmental Biology,Embryology,Anatomy

Reference42 articles.

1. Efficient string matching: an aid to bibliographic search;Aho;Commun. ACM,1975

2. STREME: accurate and versatile sequence motif discovery;Bailey;Bioinformatics,2021

3. Integrative machine learning framework for the identification of cell-specific enhancers from the human genome;Basith;Brief. Bioinf,2021

4. Fisher’s hypergeometric test for a comparison in a finite population;Beal;Am. Stat,1976

5. Enriching word vectors with subword information;Bojanowski;Trans. Assoc. Comput. Ling,2017

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3