Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks-Reference-Cited by-同舟云学术

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks

Published:2024-08-26 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ko Young Su^ORCID,Parkinson Jonathan^ORCID,Wang Wei^ORCID

Abstract

AbstractProtein language models (pLMs) have traditionally been trained in an unsupervised manner using large protein sequence databases with an autoregressive or masked-language modeling training paradigm. Recent methods have attempted to enhance pLMs by integrating additional information, in the form of text, which are referred to as “text+protein” language models (tpLMs). We evaluate and compare six tpLMs (OntoProtein, ProteinDT, ProtST, ProteinCLIP, ProTrek, and ESM3) against ESM2, a baseline text-free pLM, across six downstream tasks designed to assess the learned protein representations. We find that while tpLMs outperform ESM2 in five out of six benchmarks, no tpLM was consistently the best. Thus, we additionally investigate the potential of embedding fusion, exploring whether the combinations of tpLM embeddings can improve performance on the benchmarks by exploiting the strengths of multiple tpLMs. We find that combinations of tpLM embeddings outperform single tpLM embeddings in five out of six benchmarks, highlighting its potential as a useful strategy in the field of machine-learning for proteins. To facilitate the practical application of embedding fusion, we outline a heuristic framework to efficiently identify the optimal combination of embeddings, reducing the exponential time complexity of an exhaustive combination search down to a manageable linear time complexity. Using our embedding fusion framework, we achieve state-of-the-art performances on the protein-protein interaction prediction and homologous sequence recovery tasks without any specific model adjustments or hyperparameter tuning. Our experiments suggest embedding fusion is a useful tool in the machine-learning for proteins toolbox. Lastly, this study highlights the potential of future research on additional strategies for maximizing the utility of pLMs.

Publisher

Cold Spring Harbor Laboratory

Reference43 articles.

1. Genome-wide prediction of disease variant effects with a deep protein language model;Nat Genet,2023

2. Evolutionary-scale prediction of atomic-level protein structure with a language model

3. Efficient evolution of human antibodies from general protein language models;Nat Biotechnol,2024

4. Large language models generate functional protein sequences across diverse families;Nat Biotechnol,2023

5. Chen, B. et al. xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein. Preprint at http://arxiv.org/abs/2401.06199 (2024)