Abstract
AbstractThe interactions between nucleic acids and proteins are important in diverse biological processes. The high-quality prediction of nucleic-acid-binding sites continues to pose a significant challenge. Presently, the predictive efficacy of sequence-based methods is constrained by their exclusive consideration of sequence context information, whereas structure-based methods are unsuitable for proteins lacKing Known tertiary structures. Though protein structures predicted by AlphaFold2 could be used, the extensive computing requirement of AlphaFold2 hinders its use for genome-wide applications. Based on the recent breaKthrough of ESMFold for fast prediction of protein structures, we have developed GLMSite, which accurately identifies DNA and RNA-binding sites using geometric graph learning on ESMFold predicted structures. Here, the predicted protein structures are employed to construct protein structural graph with residues as nodes and spatially neighboring residue pairs for edges. The node representations are further enhanced through the pre-trained language model ProtTrans. The networK was trained using a geometric vector perceptron, and the geometric embeddings were subsequently fed into a common networK to acquire common binding characteristics. Then two fully connected layers were employed to learn specific binding patterns for DNA and RNA, respectively. Through comprehensive tests on DNA/RNA benchmarK datasets, GLMSite was shown to surpass the latest sequence-based methods and be comparable with structure-based methods. Moreover, the prediction was shown useful for the inference of nucleic-acid-binding proteins, demonstrating its potential for protein function discovery. The datasets, codes, together with trained models are available athttps://github.com/biomed-AI/nucleic-acid-binding.
Publisher
Cold Spring Harbor Laboratory
Reference40 articles.
1. PROTEIN-NUCLEIC ACID INTERACTIONS IN TRANSCRIPTION: A Molecular Analysis
2. CATH – a hierarchic classification of protein domain structures
3. Quantitative parameters for amino acid-base interaction: Implications for prediction of protein-DNA binding sites
4. Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs;IEEE/ACM transactions on computational biology and bioinformatics,2016
5. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA-and RNA-binding residues;Nucleic acids research,2017