Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis-Reference-Cited by-同舟云学术

Evaluation of Methods for Protein Representation Learning: A Quantitative Analysis

Published:2020-10-28 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Unsal Serbulent,Ataş Heval,Albayrak Muammer,Turhan Kemal,Acar Aybar C.,Doğan Tunca^ORCID

Abstract

AbstractData-centric approaches have been utilized to develop predictive methods for elucidating uncharacterized aspects of proteins such as their functions, biophysical properties, subcellular locations and interactions. However, studies indicate that the performance of these methods should be further improved to effectively solve complex problems in biomedicine and biotechnology. A data representation method can be defined as an algorithm that calculates numerical feature vectors for samples in a dataset, to be later used in quantitative modelling tasks. Data representation learning methods do this by training and using a model that employs statistical and machine/deep learning algorithms. These novel methods mostly take inspiration from the data-driven language models that have yielded ground-breaking improvements in the field of natural language processing. Lately, these learned data representations have been applied to the field of protein informatics and have displayed highly promising results in terms of extracting complex traits of proteins regarding sequence-structure-function relations. In this study, we conducted a detailed investigation over protein representation learning methods, by first categorizing and explaining each approach, and then conducting benchmark analyses on; (i) inferring semantic similarities between proteins, (ii) predicting ontology-based protein functions, and (iii) classifying drug target protein families. We examine the advantages and disadvantages of each representation approach over the benchmark results. Finally, we discuss current challenges and suggest future directions. We believe the conclusions of this study will help researchers in applying machine/deep learning-based representation techniques on protein data for various types of predictive tasks. Furthermore, we hope it will demonstrate the potential of machine learning-based data representations for protein science and inspire the development of novel methods/tools to be utilized in the fields of biomedicine and biotechnology.

Publisher

Cold Spring Harbor Laboratory

Reference132 articles.

1. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature

2. Distinguishing Enzyme Structures from Non-enzymes Without Alignments

3. Assignment of EC Numbers to Enzymatic Reactions with MOLMAP Reaction Descriptors and Random Forests

4. Asgari, E. & Mofrad, M. R. K. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 10, e0141287 (2015).

5. Kimothi, D. , Soni, A. , Biyani, P. & Hogan, J. M. Distributed Representations for Biological Sequence Analysis. arXiv [cs.LG] (2016).

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. TooT-PLM-P2S: Incorporating Secondary Structure Information into Protein Language Models;2024-08-13

2. Exploiting protein language models for the precise classification of ion channels and ion transporters;Proteins: Structure, Function, and Bioinformatics;2024-04-24

3. Exploiting protein language models for the precise classification of ion channels and ion transporters;2023-07-12

4. Protein design via deep learning;Briefings in Bioinformatics;2022-03-25

5. Representation learning applications in biological sequence analysis;2021-02-27