Evaluating large language models for annotating proteins

Author:

Vitale Rosario1,Bugnon Leandro A1,Fenoy Emilio Luis1,Milone Diego H1,Stegmayer Georgina1

Affiliation:

1. Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL) , Ciudad Universitaria, Santa Fe , Argentina

Abstract

Abstract In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam

Funder

National Agency for Scientific and Technological

Agencia Santafesina de Ciencia, Tecnología e Innovación

University of Nebraska-Lincoln

Publisher

Oxford University Press (OUP)

Reference35 articles.

1. UniProt: the universal protein knowledgebase in 2023;The UniProt Consortium;Nucleic Acids Res,2022

2. Basic local alignment search tool;Altschul;J Mol Biol,1990

3. The challenge of increasing Pfam coverage of the human proteome;Mistry;Database,2013

4. PFAM: the protein families database in 2021;Mistry;Nucleic Acids Res,2020

5. Using deep learning to annotate the protein universe;Bileschi;Nat Biotechnol,2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3