Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures-Reference-Cited by-同舟云学术

Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures

Published:2023-09-22 Issue:6 Volume:24 Page:
ISSN:1467-5463
Container-title:Briefings in Bioinformatics
language:en
Short-container-title:

Author:

Song Yidong¹,Yuan Qianmu¹,Zhao Huiying¹,Yang Yuedong¹

Affiliation:

1. Key Laboratory of Machine Intelligence and Advanced Computing of MOE, School of Computer Science and Engineering, Sun Yat-sen University , Guangzhou 510000 , China

Abstract

Abstract The interactions between nucleic acids and proteins are important in diverse biological processes. The high-quality prediction of nucleic-acid-binding sites continues to pose a significant challenge. Presently, the predictive efficacy of sequence-based methods is constrained by their exclusive consideration of sequence context information, whereas structure-based methods are unsuitable for proteins lacking known tertiary structures. Though protein structures predicted by AlphaFold2 could be used, the extensive computing requirement of AlphaFold2 hinders its use for genome-wide applications. Based on the recent breakthrough of ESMFold for fast prediction of protein structures, we have developed GLMSite, which accurately identifies DNA- and RNA-binding sites using geometric graph learning on ESMFold predicted structures. Here, the predicted protein structures are employed to construct protein structural graph with residues as nodes and spatially neighboring residue pairs for edges. The node representations are further enhanced through the pre-trained language model ProtTrans. The network was trained using a geometric vector perceptron, and the geometric embeddings were subsequently fed into a common network to acquire common binding characteristics. Finally, these characteristics were input into two fully connected layers to predict binding sites with DNA and RNA, respectively. Through comprehensive tests on DNA/RNA benchmark datasets, GLMSite was shown to surpass the latest sequence-based methods and be comparable with structure-based methods. Moreover, the prediction was shown useful for inferring nucleic-acid-binding proteins, demonstrating its potential for protein function discovery. The datasets, codes, and trained models are available at https://github.com/biomed-AI/nucleic-acid-binding.

Funder

National Key Research and Development Program of China

National Natural Science Foundation of China

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Link

https://academic.oup.com/bib/article-pdf/24/6/bbad360/52011661/bbad360.pdf

Reference40 articles.

1. Protein-nucleic acid interactions in transcription: a molecular analysis;Hippel;Annu Rev Biochem,1984

2. CATH–a hierarchic classification of protein domain structures;Orengo;Structure,1997