Learning meaningful representations of protein sequences-Reference-Cited by-同舟云学术

Learning meaningful representations of protein sequences

Published:2022-04-08 Issue:1 Volume:13 Page:
ISSN:2041-1723
Container-title:Nature Communications
language:en
Short-container-title:Nat Commun

Author:

Detlefsen Nicki Skafte,Hauberg Søren^ORCID,Boomsma Wouter^ORCID

Abstract

AbstractHow we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.

Publisher

Springer Science and Business Media LLC

Subject

General Physics and Astronomy,General Biochemistry, Genetics and Molecular Biology,General Chemistry,Multidisciplinary

Link

https://www.nature.com/articles/s41467-022-29443-w.pdf

Reference60 articles.

1. Bengio, Y., Courville, A. & Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828 (2013).

2. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

3. Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. In International Conference on Learning Representations (2019).

4. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

5. Rao, R. et al. Evaluating protein transfer learning with TAPE. In Advances in neural information processing systems 32, 9689–9701 (2019).

Cited by 80 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SERT-StructNet: Protein secondary structure prediction method based on multi-factor hybrid deep model;Computational and Structural Biotechnology Journal;2024-12

2. Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification;Computational Biology and Chemistry;2024-10

3. Impact of Multi-Factor Features on Protein Secondary Structure Prediction;Biomolecules;2024-09-13

4. T-cell receptor binding prediction: A machine learning revolution;ImmunoInformatics;2024-09

5. TooT-PLM-P2S: Incorporating Secondary Structure Information into Protein Language Models;2024-08-13