Survey of Protein Sequence Embedding Models
-
Published:2023-02-14
Issue:4
Volume:24
Page:3775
-
ISSN:1422-0067
-
Container-title:International Journal of Molecular Sciences
-
language:en
-
Short-container-title:IJMS
Author:
Tran Chau1, Khadkikar Siddharth2, Porollo Aleksey345ORCID
Affiliation:
1. Department of Computer Science, University of Cincinnati, Cincinnati, OH 45219, USA 2. Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, USA 3. Center for Autoimmune Genomics and Etiology, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA 4. Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229, USA 5. Department of Pediatrics, University of Cincinnati, Cincinnati, OH 45267, USA
Abstract
Derived from the natural language processing (NLP) algorithms, protein language models enable the encoding of protein sequences, which are widely diverse in length and amino acid composition, in fixed-size numerical vectors (embeddings). We surveyed representative embedding models such as Esm, Esm1b, ProtT5, and SeqVec, along with their derivatives (GoPredSim and PLAST), to conduct the following tasks in computational biology: embedding the Saccharomyces cerevisiae proteome, gene ontology (GO) annotation of the uncharacterized proteins of this organism, relating variants of human proteins to disease status, correlating mutants of beta-lactamase TEM-1 from Escherichia coli with experimentally measured antimicrobial resistance, and analyzing diverse fungal mating factors. We discuss the advances and shortcomings, differences, and concordance of the models. Of note, all of the models revealed that the uncharacterized proteins in yeast tend to be less than 200 amino acids long, contain fewer aspartates and glutamates, and are enriched for cysteine. Less than half of these proteins can be annotated with GO terms with high confidence. The distribution of the cosine similarity scores of benign and pathogenic mutations to the reference human proteins shows a statistically significant difference. The differences in embeddings of the reference TEM-1 and mutants have low to no correlation with minimal inhibitory concentrations (MIC).
Funder
National Institutes of Health
Subject
Inorganic Chemistry,Organic Chemistry,Physical and Theoretical Chemistry,Computer Science Applications,Spectroscopy,Molecular Biology,General Medicine,Catalysis
Reference21 articles.
1. The language of proteins: Nlp, machine learning & protein sequences;Ofer;Comput. Struct. Biotechnol. J.,2021 2. Long short-term memory;Hochreiter;Neural. Comput.,1997 3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4–9). Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. 4. Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D., Matthes, F., and Rost, B. (2019). Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinform., 20. 5. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function;Makrodimitris;Bioinformatics,2021
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|