Abstract
AbstractBenzo[a]pyrene, a notorious DNA-damaging carcinogen, belongs to the family of polycyclic aromatic hydrocarbons commonly found in tobacco smoke. Surprisingly, nucleotide excision repair (NER) machinery exhibits inefficiency in recognising specific bulky DNA adducts including Benzo[a]pyrene Diol-Epoxide (BPDE), a Benzo[a]pyrene metabolite. While sequence context is emerging as the leading factor linking the inadequate NER response to BPDE adducts, the precise structural attributes governing these disparities remain inadequately understood. We therefore combined the domains of molecular dynamics and machine learning to conduct a comprehensive assessment of helical distortion caused by BPDE-Guanine adducts in multiple gene contexts. Specifically, we implemented a dual approach involving a random forest classification-based analysis and subsequent feature selection to identify precise topological features that may distinguish adduct sites of variable repair capacity. Our models were trained using helical data extracted from duplexes representing both BPDE hotspot and non-hotspot sites within theTP53gene, then applied to sites withinTP53, cII, andlacZgenes.We show our optimised model consistently achieved exceptional performance, with accuracy, precision, and f1 scores exceeding 91%. Our feature selection approach uncovered that discernible variance in regional base pair rotation played a pivotal role in informing the decisions of our model. Notably, these disparities were highly conserved amongTP53andlacZduplexes and appeared to be influenced by the regional GC content. As such, our findings suggest that there are indeed conserved topological features distinguishing hotspots and non-hotpot sites, highlighting regional GC content as a potential biomarker for mutation.Author SummaryAlthough much is known about DNA repair processes, we are still lacking some fundamental understanding relating to DNA sequence and mutation rates, specifically why some sequences mutate at a higher rate or are repaired less than others. We believe that by using a combination of Molecular Simulation and Machine Learning (ML) we can measure which structural features are present in sequences which mutate at higher rates in cancer gene and lab-based test assays frequently used to investigate toxicology.Here we have run Molecular Dynamics on five sets of DNA sequences with and without a carcinogen found in cigarette smoke to allow us to study the mutation event that would need to be repaired. We have measured their helical and base stacking properties. We have used ML to successfully differentiate between low and high mutating sequences using this model allowing us to begin to elucidate the structural features these groups have in common.We believe this method could have wide reaching uses, it could be applied to any gene context and mutation event and indeed the knowledge of the structural features which are best repaired gives us insight into the biophysics of DNA repair adding knowledge to the drug design pipeline.
Publisher
Cold Spring Harbor Laboratory