The geometry of hidden representations of protein language models-Reference-Cited by-同舟云学术

The geometry of hidden representations of protein language models

Published:2022-10-26 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Valeriani Lucrezia,Cuturello Francesca,Ansuini Alessio^ORCID,Cazzaniga Alberto^ORCID

Abstract

AbstractProtein language models (pLMs) transform their input into a sequence of hidden representations whose geometric behavior changes across layers. Looking at fundamental geometric properties such as the intrinsic dimension and the neighbor composition of these representations, we observe that these changes highlight a pattern characterized by three distinct phases. This phenomenon emerges across many models trained on diverse datasets, thus revealing a general computational strategy learned by pLMs to reconstruct missing parts of the data. These analyses show the existence of low-dimensional maps that encode evolutionary and biological properties such as remote homology and structural information. Our geometric approach sets the foundations for future systematic attempts to understand thespaceof protein sequences with representation learning techniques.

Publisher

Cold Spring Harbor Laboratory

Reference26 articles.

1. Mohammed AlQuraishi . ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics, 20, 2019.

2. Alessio Ansuini , Alessandro Laio , Jakob H Macke , and Davide Zoccolan . Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.

3. SCOPe: improvements to the structural classification of proteins – extended database to facilitate variant interpretation and machine learning

4. N.S. Detlefsen , S. Hauberg , and W. Boomsma . Learning meaningful representations of protein sequences. Nature Communications, 13, 2022.

5. Bert: Pre-training of deep bidirectional transformers for language understanding;Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,2019

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life;2024-07-16

2. Fine-tuning protein language models boosts predictions across diverse tasks;2023-12-14

3. Large language models implicitly learn to straighten neural sentence trajectories to construct a predictive representation of natural language;2023-11-06

4. Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering;2023-04-21