Abstract
AbstractModern proteins did not arise abruptly, as singular events, but rather over the course of at least 3.5 billion years of evolution. Can machine learning teach us how this occurred? The molecular evolutionary processes that yielded the intricate three-dimensional (3D) structures of proteins involve duplication, recombination and mutation of genetic elements, corresponding to short peptide fragments. Identifying and elucidating these ancestral fragments is crucial to deciphering the interrelationships amongst proteins, as well as how evolution acts upon protein sequences, structures & functions. Traditionally, structural fragments have been found using sequence-based and 3D structural alignment approaches, but that becomes challenging when proteins have undergone extensive permutations—allowing two proteins to share a common architecture, though their topologies may drastically differ (a phenomenon termed theUrfold). We have designed a new framework to identify compact, potentially-discontinuous peptide fragments by combining (i) deep generative models of protein superfamilies with (ii) layerwise relevance propagation (LRP) to identify atoms of great relevance in creating an embedding during an allsuperfamilies× alldomainsanalysis. Our approach recapitulates known relationships amongst the evolutionarily ancient smallβ-barrels (e.g. SH3 and OB folds) and amongst P-loop–containing proteins (e.g. Rossmann and P-loop NTPases), previously established via manual analysis. Because of the generality of our deep model’s approach, we anticipate that it can enable the discovery of new ancestral peptides. In a sense, our framework uses LRP as an ‘explainable AI’ approach, in conjunction with a recent deep generative model of protein structure (termedDeepUrfold), in order to leverage decades worth of structural biology knowledge to decipher the underlying molecular bases for protein structural relationships—including those which are exceedingly remote, yet discoverable via deep learning.
Publisher
Cold Spring Harbor Laboratory
Reference30 articles.
1. Philip E. Bourne , Eli J. Draizen , and Cameron Mura . The curse of the ribbon. PLoS Biology, Accepted 2022.
2. Vamsi Nallapareddy , Nicola Bordin , Ian Sillitoe , Michael Heinzinger , Maria Littmann , Vaishali Waman , Neeladri Sen , Burkhard Rost , and Christine Orengo . CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models. bioRxiv, 2022.
3. Contrastive learning on protein embeddings enlightens midnight zone;NAR Genomics and Bioinformatics,2022
4. Tymor Hamamsy , James T. Morton , Daniel Berenberg , Nicholas Carriero , Vladimir Gligorijevic , Robert Blackwell , Charlie E. M. Strauss , Julia Koehler Leman , Kyunghyun Cho , and Richard Bonneau . TM-Vec: Template modeling vectors for fast homology detection and alignment. bioRxiv, 2022.
5. Learning the protein language: Evolution, structure, and function;Cell Systems,2021