Abstract
AbstractMachine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be of utility, such datasets must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently far more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. Here, we reportProp3D, a protein biophysical and evolutionary featurization and data-processing pipeline that we have developed and deployed—both in the cloud and on local HPC resources—in order to systematically and reproducibly create comprehensive datasets, using the Highly Scalable Data Service (HSDS).Prop3Dand its associated ‘Prop3D-20sf’ dataset can be of broader utility, as a community-wide resource, for other structure-related workflows, particularly for tasks that arise at the intersection of deep learning and classical structural bioinformatics.Author SummaryWe have developed a ‘Prop3D’ platform that allows for the creation, sharing and reuse of atomically-resolved physicochemical properties for any library of protein domains (preexisting or user-defined); in addition, we provide an associated ‘Prop3D-20sf’ protein dataset, obtained by applying this approach to CATH. This resource may be of use to the broader community in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various computational platforms (cloud-based or local high-performance compute clusters), with scalability achieved largely by saving the results to distributed HDF5 files via the Highly Scalable Data Service (HSDS). Secondly, the linkedProp3D-20sfdataset provides a hand-crafted, already-featurized dataset of protein structures for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly,Prop3D-20sf’s construction explicitly takes into account (in creating datasets and data-splits) the enigma of ‘data leakage’, stemming from the evolutionary relationships between proteins. The datasets that we provide (using HSDS) can be freely accessed via a standard representational state transfer (REST) application programming interface (API), along with accompanying Python wrappers for NumPy and the popular ML framework PyTorch.
Publisher
Cold Spring Harbor Laboratory
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献