Affiliation:
1. Oak Ridge National Laboratory, Oak Ridge, TN, USA
Abstract
The COVID-19 pandemic highlights the need for computational tools to automate and accelerate drug design for novel protein targets. We leverage deep learning language models to generate and score drug candidates based on predicted protein binding affinity. We pre-trained a deep learning language model (BERT) on ∼9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision. Our work reduces pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude. For scoring, we fine-tuned the language model using an assembled set of thousands of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. We utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model. Our generalizable models accelerate the identification of inhibitors for emerging therapeutic targets.
Funder
U.S. Department of Energy
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Reference74 articles.
1. GlaserJ (2021) Binding Affinity Training Data Set URL https://huggingface.co/datasets/jglaser/binding_affinity
2. Supercomputer-Based Ensemble Docking Drug Discovery Pipeline with Application to Covid-19
3. Achdout H, Aimon A, Bar-David E et al. (2020) COVID moonshot: open science discovery of SARS-CoV-2 main protease inhibitors by combining crowdsourcing, high-throughput experiments, computational simulations, and machine learning. bioRxiv.
4. High Performance I/O For Large Scale Deep Learning
5. Randomized SMILES strings improve the quality of molecular generative models
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献