KMAP: Kmer Manifold Approximation and Projection for visualizing DNA sequences

Author:

Fu Chengbo,Niskanen Einari A.,Wei Gong-Hong,Yang Zhirong,Sanvicente-García Marta,Güell Marc,Cheng LuORCID

Abstract

ABSTRACTIdentifying and illustrating patterns in DNA sequences is a crucial task in various biological data analyses. In this task, patterns are often represented by sets of kmers, the fundamental building blocks of DNA sequences. To visually unveil these patterns, we could project each kmer onto a point in two-dimensional (2D) space. However, this projection poses challenges due to the high-dimensional nature of kmers and their unique mathematical properties. Here, we established a mathematical system to address the peculiarities of the kmer manifold. Leveraging this kmer manifold theory, we developed a statistical method named KMAP for detecting kmer patterns and visualizing them in 2D space. We applied KMAP to three distinct datasets to showcase its utility. KMAP achieved a comparable performance to the classical method MEME, with approximately 90% similarity in motif discovery from HT-SELEX data. In the analysis of H3K27ac ChIP-seq data from Ewing Sarcoma (EWS), we found that BACH1, OTX2 and ERG1 might affect EWS prognosis by binding to promoter and enhancer regions across the genome. We also found that FLI1 bound to the enhancer regions after ETV6 degradation, which showed the competitive binding between ETV6 and FLI1. Moreover, KMAP identified four prevalent patterns in gene editing data of the AAVS1 locus, aligning with findings reported in the literature. These applications underscore that KMAP could be a valuable tool across various biological contexts. KMAP is freely available at:https://github.com/chengl7-lab/kmap.

Publisher

Cold Spring Harbor Laboratory

Reference38 articles.

1. DREME: motif discovery in transcription factor ChIP-seq data

2. Bailey, T. L. , & Elkan, C. (1994). Fitting a mixture model by expectation maximization to discover motifs in bipolymers.

3. The MEME Suite

4. Carlson, M. , & Maintainer, B. (2015). TxDb. Hsapiens. UCSC. hg19. knownGene: Annotation package for TxDb object (s).(R package version 3.2. 2.). TxDb. Hsapiens. UCSC. hg19. knownGene: Annotation package for TxDb object (s). R package version 3.2. 2.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3