An explainable unsupervised framework for alignment-free protein classification using sequence embeddings-Reference-Cited by-同舟云学术

An explainable unsupervised framework for alignment-free protein classification using sequence embeddings

Published:2022-02-10 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Yeung Wayland,Zhou Zhongliang^ORCID,Mathew Liju,Gravel Nathan,Taujale Rahil,Venkat Aarya,Lanzilotta William,Li Sheng,Kannan Natarajan

Abstract

ABSTRACTProtein classification is a cornerstone of biology that relies heavily on alignment-based comparison of primary sequences. However, the systematic classification of large protein superfamilies is impeded by unique challenges in aligning divergent sequence datasets. We developed an alignment-free approach for sequence analysis and classification using embedding vectors generated from pre-trained protein language models that capture underlying protein structural-functional properties from unsupervised training on millions of biologically-observed sequences. We constructed embedding-based trees (with branch support) which depict hierarchical clustering of protein sequences and infer fast/slow evolving sites through interpretable sequence projections. Applied towards diverse protein superfamilies, embedding tree infers Casein Kinase 1 (CK1) as the basal protein kinase clade, identifies convergent functional motifs shared between divergent phosphatase folds, and infers evolutionary relationships between diverse radical S-Adenosyl-L-Methionine (SAM) enzyme families. Overall results indicate that embedding trees effectively capture global data structures, functioning as a general unsupervised approach for visualizing high-dimensional manifolds.

Publisher

Cold Spring Harbor Laboratory

Reference75 articles.

1. Pfam: The protein families database in 2021

2. Predicting functionally important residues from sequence conservation

3. Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models;Phys. Rev. E,2013

4. Molecular phylogenetics: principles and practice

5. Twilight zone of protein sequence alignments