Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport-Reference-Cited by-同舟云学术

Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport

Published:2024-01-31 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

NaderiAlizadeh Navid^ORCID,Singh Rohit

Abstract

AbstractMotivationProtein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into informative embeddings suitable for a range of applications. PLMs, as well as many other protein representation schemes, generate per-token (i.e., per-residue) representations, leading to variable-sized outputs based on protein length. This variability presents a challenge for protein-level prediction tasks, which require uniform-sized embeddings for consistent analysis across different proteins. Prior work has typically resorted to average pooling to summarize token-level PLM outputs. It is, however, unclear if such an aggregation operation effectively prioritizes the relevant information across token-level representations.ResultsAddressing this, we introduce a novel method utilizing sliced-Wasserstein embeddings to convert variable-length PLM outputs into fixed-length protein-level representations. Inspired by the success of optimal transport techniques in representation learning, we first conceptualize per-token PLM outputs as samples from a probabilistic distribution. We then employ sliced-Wasserstein distances to map these samples against a learnable reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. Across a range of state-of-the-art pre-trained ESM-2 PLMs, with varying model sizes, we show the superiority of our method over average pooling for protein-drug and protein-protein interaction. Our aggregation scheme is especially effective when model size is constrained, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Since using smaller models reduces computational resource requirements, our approach not only promises more accurate inference but can also help democratize access to foundation models.Availability and implementationThe implementation code can be found athttps://github.com/navid-naderi/PLM_SWE.

Publisher

Cold Spring Harbor Laboratory

Reference50 articles.

1. Gpt-4 technical report;arXiv preprint,2023

2. Learning the protein language: Evolution, structure, and function

3. Cracking the black box of deep sequence-based protein-protein interaction prediction

4. Genome-wide prediction of disease variant effects with a deep protein language model

5. xtrimopglm: unified 100b-scale pre-trained transformer for deciphering the language of protein;arXiv preprint,2024

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks;2024-08-26

2. Democratizing protein language models with parameter-efficient fine-tuning;Proceedings of the National Academy of Sciences;2024-06-20

3. ProteinCLIP: enhancing protein language models with natural language;2024-05-17