Abstract
Large language models (LLMs) have greatly enhanced our ability to understand biology and chemistry. Yet, relatively few robust methods have been reported for structure-based drug discovery. Highly precise biomolecule-ligand interaction datasets are urgently needed in particular for LLMs, that require extensive training data. We present MISATO, the first dataset that combines quantum mechanics properties of small molecules and associated molecular dynamics simulations of about 20000 experimental protein-ligand complexes. Starting from the PDBbind dataset, semi-empirical quantum mechanics was used to systematically refine these structures. The largest collection to date of molecular dynamics traces of protein-ligand complexes in explicit water are included, accumulating to 170 μs. We give ML baseline models and simple Python data loaders, and aim to foster a thriving community around MISATO (https://github.com/t7morgen/misato-dataset). An easy entry point for ML experts is provided without the need of deep domain expertise to enable the next generation of drug discovery AI models.
Publisher
Cold Spring Harbor Laboratory
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献