Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling-Reference-Cited by-同舟云学术

Ankh ☥: Optimized Protein Language Model Unlocks General-Purpose Modelling

Published:2023-01-18 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Elnaggar Ahmed,Essam Hazem,Salah-Eldin Wafaa,Moustafa Walid,Elkerdawy Mohamed,Rochereau Charlotte,Rost Burkhard

Abstract

AbstractAs opposed to scaling-up protein language models (PLMs), we seek improving performance via protein-specific optimization. Although the proportionality between the language model size and the richness of its learned representations is validated, we prioritize accessibility and pursue a path of data-efficient, cost-reduced, and knowledge-guided optimization. Through over twenty experiments ranging from masking, architecture, and pre-training data, we derive insights from protein-specific experimentation into building a model that interprets the language of life, optimally. We present Ankh, the first general-purpose PLM trained on Google’s TPU-v4 surpassing the state-of-the-art performance with fewer parameters (<10% for pre-training, <7% for inference, and <30% for the embedding dimension). We provide a representative range of structure and function benchmarks where Ankh excels. We further provide a protein variant generation analysis on High-N and One-N input data scales where Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics. We dedicate our work to promoting accessibility to research innovation via attainable resources.

Publisher

Cold Spring Harbor Laboratory

Reference60 articles.

1. BERTology meets biology: interpreting attention in protein language models;arXiv preprint,2020

2. Rao, Roshan and Meier, Joshua and Sercu, Tom and Ovchinnikov, Sergey and Rives, Alexander. Transformer protein language models are unsupervised structure learners. Biorxiv, 2020.

3. Prot-Trans: towards cracking the language of Life’s code through self-supervised deep learning and high performance computing;others;arXiv preprint,2020

4. Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal, Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott, Myle and Zitnick, C Lawrence and Ma, Jerry and others. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, vol. 118, no. 15, 2021.

5. Heinzinger, Michael and Elnaggar, Ahmed and Wang, Yu and Dallago, Christian and Nechaev, Dmitrii and Matthes, Florian and Rost, Burkhard. Modeling aspects of the language of life through transfer-learning protein sequences. BMC bioinformatics, vol. 20, no. 1, 2019.

Cited by 35 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Assessing the role of evolutionary information for enhancing protein language model embeddings;Scientific Reports;2024-09-05

2. AI-accelerated therapeutic antibody development: practical insights;Frontiers in Drug Discovery;2024-09-03

3. Fine-tuning protein language models boosts predictions across diverse tasks;Nature Communications;2024-08-28

4. MuLAN: Mutation-driven Light Attention Networks for investigating protein-protein interactions from sequences;2024-08-26

5. TooT-PLM-P2S: Incorporating Secondary Structure Information into Protein Language Models;2024-08-13