Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model-Reference-Cited by-同舟云学术

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model

Published:2023-06-15 Issue:4 Volume:24 Page:
ISSN:1467-5463
Container-title:Briefings in Bioinformatics
language:en
Short-container-title:

Author:

Meng Qiaozhen¹,Guo Fei²,Tang Jijun³

Affiliation:

1. School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University , Tianjin , China

2. School of Computer Science and Engineering, Central South University , Changsha 410083 , China

3. Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences , Shenzhen 518000 , China

Abstract

Abstract In recent years, protein structure problems have become a hotspot for understanding protein folding and function mechanisms. It has been observed that most of the protein structure works rely on and benefit from co-evolutionary information obtained by multiple sequence alignment (MSA). As an example, AlphaFold2 (AF2) is a typical MSA-based protein structure tool which is famous for its high accuracy. As a consequence, these MSA-based methods are limited by the quality of the MSAs. Especially for orphan proteins that have no homologous sequence, AlphaFold2 performs unsatisfactorily as MSA depth decreases, which may pose a barrier to its widespread application in protein mutation and design problems in which there are no rich homologous sequences and rapid prediction is needed. In this paper, we constructed two standard datasets for orphan and de novo proteins which have insufficient/none homology information, called Orphan62 and Design204, respectively, to fairly evaluate the performance of the various methods in this case. Then, depending on whether or not utilizing scarce MSA information, we summarized two approaches, MSA-enhanced and MSA-free methods, to effectively solve the issue without sufficient MSAs. MSA-enhanced model aims to improve poor MSA quality from the data source by knowledge distillation and generation models. MSA-free model directly learns the relationship between residues on enormous protein sequences from pre-trained models, bypassing the step of extracting the residue pair representation from MSA. Next, we evaluated the performance of four MSA-free methods (trRosettaX-Single, TRFold, ESMFold and ProtT5) and MSA-enhanced (Bagging MSA) method compared with a traditional MSA-based method AlphaFold2, in two protein structure-related prediction tasks, respectively. Comparison analyses show that trRosettaX-Single and ESMFold which belong to MSA-free method can achieve fast prediction ($\sim\! 40$s) and comparable performance compared with AF2 in tertiary structure prediction, especially for short peptides, $\alpha $-helical segments and targets with few homologous sequences. Bagging MSA utilizing MSA enhancement improves the accuracy of our trained base model which is an MSA-based method when poor homology information exists in secondary structure prediction. Our study provides biologists an insight of how to select rapid and appropriate prediction tools for enzyme engineering and peptide drug development. Contact guofei@csu.edu.cn, jj.tang@siat.ac.cn

Funder

National Key Research and Development Program of China

National Natural Science Foundation of China

Excellent Young Scientists Fund in Hunan Province

Scientific Research Fund of Hunan Provincial Education Department

Zhejiang Lab Open Research Project

Shenzhen Science and Technology Program

High Performance Computing Center of Central South University

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems