Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis-Reference-Cited by-同舟云学术

Exploring the limitations of biophysical propensity scales coupled with machine learning for protein sequence analysis

Published:2019-11-15 Issue:1 Volume:9 Page:
ISSN:2045-2322
Container-title:Scientific Reports
language:en
Short-container-title:Sci Rep

Author:

Raimondi Daniele,Orlando Gabriele,Vranken Wim F.,Moreau Yves

Abstract

AbstractMachine learning (ML) is ubiquitous in bioinformatics, due to its versatility. One of the most crucial aspects to consider while training a ML model is to carefully select the optimal feature encoding for the problem at hand. Biophysical propensity scales are widely adopted in structural bioinformatics because they describe amino acids properties that are intuitively relevant for many structural and functional aspects of proteins, and are thus commonly used as input features for ML methods. In this paper we reproduce three classical structural bioinformatics prediction tasks to investigate the main assumptions about the use of propensity scales as input features for ML methods. We investigate their usefulness with different randomization experiments and we show that their effectiveness varies among the ML methods used and the tasks. We show that while linear methods are more dependent on the feature encoding, the specific biophysical meaning of the features is less relevant for non-linear methods. Moreover, we show that even among linear ML methods, the simpler one-hot encoding can surprisingly outperform the “biologically meaningful” scales. We also show that feature selection performed with non-linear ML methods may not be able to distinguish between randomized and “real” propensity scales by properly prioritizing to the latter. Finally, we show that learning problem-specific embeddings could be a simple, assumptions-free and optimal way to perform feature learning/engineering for structural bioinformatics tasks.

Funder

Fonds Wetenschappelijk Onderzoek

KU Leuven

Publisher

Springer Science and Business Media LLC

Subject

Multidisciplinary

Link

http://www.nature.com/articles/s41598-019-53324-w.pdf

Reference40 articles.

1. Buchan, D. W., Minneci, F., Nugent, T. C., Bryson, K. & Jones, D. T. Scalable web services for the PSIPRED Protein Analysis Workbench. Nucleic acids research. 41(W1), W349–W357 (2013).

2. Pollastri, G., Przybylski, D., Rost, B. & Baldi, P. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins: Structure, Function, and Bioinformatics. 47(2), 228–235 (2002).