Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies

Author:

Yeung Wayland1,Zhou Zhongliang2,Mathew Liju3,Gravel Nathan1,Taujale Rahil1,O’Boyle Brady4,Salcedo Mariah4,Venkat Aarya4,Lanzilotta William4,Li Sheng5,Kannan Natarajan14

Affiliation:

1. Institute of Bioinformatics, University of Georgia , 30602, Georgia , USA

2. School of Computing, University of Georgia , 30602, Georgia , USA

3. Department of Microbiology, University of Georgia , 30602, Georgia , USA

4. Department of Biochemistry and Molecular Biology, University of Georgia , 30602, Georgia , USA

5. School of Data Science, University of Virginia , 22903, Virginia , USA

Abstract

Abstract Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets.

Funder

National Institutes of Health

Publisher

Oxford University Press (OUP)

Subject

Molecular Biology,Information Systems

Reference54 articles.

1. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences;Rives;Proc Natl Acad Sci,2021

2. Prottrans: towards cracking the language of lifes code through self-supervised deep learning and high performance computing;Elnaggar;IEEE Trans Pattern Anal Mach Intell,2021

3. Learning the protein language: evolution, structure, and function;Bepler;Cell systems,2021

4. Evaluating protein transfer learning with tape;Rao;Advances in neural information processing systems,2019

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3