Machine learning to classify mutational hotspots from molecular dynamic simulations

Author:

Davies James,Menzies GeorginaORCID

Abstract

AbstractBenzo[a]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognising specific bulky DNA adducts including Benzo[a]pyrene Diol-Epoxide (BPDE), a Benzo[a]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and non-hotspot sites within theTP53gene, then applied to sites withinTP53, cII, andlacZgenes.We show our optimised model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved amongTP53andlacZduplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and non-hotpot sites, highlighting regional GC content as a potential biomarker for mutation.Author SummaryAlthough much is known about DNA repair processes, we are still lacking some fundamental understanding relating to DNA sequence and mutation rates, specifically why some sequences mutate at a higher rate or are repaired less than others. We believe that by using a combination of Molecular Simulation and Machine Learning (ML) we can measure which structural features are present in sequences which mutate at higher rates in cancer gene and lab-based test assays frequently used to investigate toxicology.Here we have run Molecular Dynamics on five sets of DNA sequences with and without a carcinogen found in cigarette smoke to allow us to study the mutation event that would need to be repaired. We have measured their helical and base stacking properties. We have used ML to successfully differentiate between low and high mutating sequences using this model allowing us to begin to elucidate the structural features these groups have in common.We believe this method could have wide reaching uses, it could be applied to any gene context and mutation event and indeed the knowledge of the structural features which are best repaired gives us insight into the biophysics of DNA repair adding knowledge to the drug design pipeline.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3