Generative pretrained autoregressive transformer graph neural network applied to the analysis and discovery of novel proteins

Author:

Buehler Markus J.12ORCID

Affiliation:

1. Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology 1 , 77 Massachusetts Ave., Cambridge, Massachusetts 02139, USA

2. Center for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology 2 , 77 Massachusetts Ave., Cambridge, Massachusetts 02139, USA

Abstract

We report a flexible language-model-based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict the secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural materials, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform eight distinct tasks, with available datasets, it can be extended to solve additional problems. In a broader sense, this study illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters that define the nature of the physical system at hand) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties, via a synergizing learning capacity, to express a set of potentialities embedded in the knowledge used in training via the interplay of universality and diversity. Significance statement: Predicting the properties of materials based on a flexible description of their structure, environment, or process, is a long-standing challenge in multiscale modeling. Our MaterioFormer language model, trained to solve forward and inverse tasks, incorporates a deep learning capacity through attention and graph strategies to yield a multimodal approach to model and design materials. Since our model is prompt-based and information is encoded consistently via byte-level utf8 tokenization, it can process diverse modalities of information, such as sequence data, description of tasks, and numbers, and offers a flexible workflow that integrates human intelligence and artificial intelligence. Autoregressive training, using pre-training against a large unlabeled dataset, allows for straightforward adjustment of specific objectives.

Funder

MIT-IBM Watson AI Lab

ARO

ONR

USDA

Publisher

AIP Publishing

Subject

General Physics and Astronomy

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3