Homology detection using a protein secondary structure-based large language model

Author:

Kogay RomanORCID,Ma Weicheng,Bousselham Jad,Yang Zechen,Rockmore DanielORCID,Zhaxybayeva OlgaORCID,Vosoughi SoroushORCID

Abstract

AbstractDetection of homology among proteins is fundamental to understanding protein function. Unfortunately, traditional homology searches using amino acid sequence similarity are limited when numerous amino acid substitutions have accumulated either due to billions of years of evolution or through processes of accelerated change. Recent applications of deep-learning approaches demonstrate that “protein language” models of amino acid sequences can improve the accuracy of the traditional homology searches. Ultimately, the ability to work seamlessly with tertiary structures of proteins will solve the homology detection challenge and provide accompanying insights directly related to function, but to date the use of 3D structures suffers both from data availability and computational bottlenecks. Herein, we present the Protein Secondary Structure Language (ProSSL) model, an efficient encoding of protein secondary structure information in a Transformer-based deep-learning architecture. We conjecture that the secondary protein structure, which is better conserved than primary sequences and much more easily predictable and available than tertiary protein structure, could aid in the task of homology detection. ProSSL has the computational advantages of primary sequence-based homology detection, while also providing important structural information for similarity scoring. Using two case studies of large, diverse viral protein families, we show that the ProSSL model successfully captures patterns of secondary structure arrangements and is effective in detecting homologs either as a pre-trained or fine-tuned model. In both tasks, we accurately detect members of these protein families, including those missed in traditional amino acid similarity searches. We also illustrate how functional insights from the individual ProSSL models could be obtained from the use of the Shapley Additive exPlanations (SHAP) values.Author SummaryWhen DNA is obtained from an organism or an environment, scientists are tasked with determining the functions of the proteins encoded in the genetic material. Such “functional annotation” relies on assigning functions based on the similarity of the proteins to counterparts in databases comprising annotated sequence data. Especially challenging is an ability to recognize similarity in proteins that accumulated a lot of amino acid changes. It is well-known that spatial structure of proteins that share ancestry and perform similar functions evolves much slower than the sequences of the proteins’ amino acids. Thus, comparison of 3D structures could address this challenge, but the data is still limited to certain classes of proteins, and the requisite computations are expensive. Herein we present a deep-learning model derived from protein secondary structure representation, a symbolic encoding of the way neighboring amino acid residues of a protein interact with each other. Unlike 3D structure, the secondary structure is quickly and accurately predictable from amino acid sequences of proteins. Using two viral proteins as case studies, we demonstrate that our model works well for detection of protein similarity, including identification of very distantly related proteins.

Publisher

Cold Spring Harbor Laboratory

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3