Diverse Database and Machine Learning Model to Narrow the Generalization Gap in RNA Structure Prediction-Reference-Cited by-同舟云学术

Diverse Database and Machine Learning Model to Narrow the Generalization Gap in RNA Structure Prediction

Published:2024-01-25 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

de Lajarte Albéric A.,Martin des Taillades Yves J.,Kalicki Colin,Fuchs Wightman Federico,Aruda Justin,Salazar Dragui,Allan Matthew F.,L’Esperance-Kerckhoff Casper,Kashi Alex,Jossinet Fabrice,Rouskin Silvi^ORCID

Abstract

AbstractUnderstanding macromolecular structures of proteins and nucleic acids is critical for discerning their functions and biological roles. Advanced techniques—crystallography, NMR, and CryoEM—have facilitated the determination of over 180,000 protein structures, all cataloged in the Protein Data Bank (PDB). This comprehensive repository has been pivotal in developing deep learning algorithms for predicting protein structures directly from sequences. In contrast, RNA structure prediction has lagged, and suffers from a scarcity of structural data. Here, we present the secondary structure models of 1098 pri-miRNAs and 1456 human mRNA regions determined through chemical probing. We develop a novel deep learning architecture, inspired from the Evoformer model of Alphafold and traditional architectures for secondary structure prediction. This new model, eFold, was trained on our newly generated database and over 300,000 secondary structures across multiple sources. We benchmark eFold on two new test sets of long and diverse RNA structures and show that our dataset and new architecture contribute to increasing the prediction performance, compared to similar state-of-the-art methods. All together, our results reveal that merely expanding the database size is insufficient for generalization across families, whereas incorporating a greater diversity and complexity of RNAs structures allows for enhanced model performance.

Publisher

Cold Spring Harbor Laboratory

Reference43 articles.

1. mRNA structure regulates protein expression through changes in functional half-life

2. Intracellular mRNA transport and localized translation

3. Recurrent emergence of Klebsiella pneumoniae carbapenem resistance mediated by an inhibitory ompK36 mRNA secondary structure

4. A Stress Response that Monitors and Regulates mRNA Structure Is Central to Cold Shock Adaptation

5. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure