Evaluating large language models for annotating proteins-Reference-Cited by-同舟云学术

Evaluating large language models for annotating proteins

Published:2024-03-27 Issue:3 Volume:25 Page:
ISSN:1467-5463
Container-title:Briefings in Bioinformatics
language:en
Short-container-title:

Author:

Vitale Rosario¹,Bugnon Leandro A¹,Fenoy Emilio Luis¹,Milone Diego H¹,Stegmayer Georgina¹

Affiliation:

1. Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL) , Ciudad Universitaria, Santa Fe , Argentina

Abstract

Abstract In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam

Funder

National Agency for Scientific and Technological

Agencia Santafesina de Ciencia, Tecnología e Innovación

University of Nebraska-Lincoln

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bib/article-pdf/25/3/bbae177/57405027/bbae177.pdf

Reference35 articles.

1. UniProt: the universal protein knowledgebase in 2023;The UniProt Consortium;Nucleic Acids Res,2022

2. Basic local alignment search tool;Altschul;J Mol Biol,1990

3. The challenge of increasing Pfam coverage of the human proteome;Mistry;Database,2013

4. PFAM: the protein families database in 2021;Mistry;Nucleic Acids Res,2020

5. Using deep learning to annotate the protein universe;Bileschi;Nat Biotechnol,2022