Redundancy-weighting the PDB for detailed secondary structure prediction using deep-learning models-Reference-Cited by-同舟云学术

Redundancy-weighting the PDB for detailed secondary structure prediction using deep-learning models

Published:2020-03-18 Issue:12 Volume:36 Page:3733-3738
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Sidi Tomer¹,Keasar Chen¹

Affiliation:

1. Department of Computer Science, Ben-Gurion University, P.O.B 653, Be'er Sheva 84105, Israel

Abstract

Abstract Motivation The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use nonredundant (NR) subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting (RW), down-weights redundant entries rather than discarding them. This approach may be particularly helpful for machine-learning (ML) methods that use the PDB as their source for data. Methods for secondary structure prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for eight-class (DSSP) prediction. As these methods typically incorporate ML techniques, training on RW datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure classes. Results This study compares the SSP performances of deep-learning models trained on either RW or NR datasets. We show that training on RW sets consistently results in better prediction of 3- (HCE), 8- (DSSP) and 13-class (STR2) secondary structures. Availability and implementation The ML models, the datasets used for their derivation and testing, and a stand-alone SSP program for DSSP and STR2 predictions, are freely available under LGPL license in http://meshi1.cs.bgu.ac.il/rw. Supplementary information Supplementary data are available at Bioinformatics online.

Funder

Israel Science Foundation

ISF

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btaa196/33294374/btaa196.pdf

Reference39 articles.

1. TensorFlow: large-scale machine learning on heterogeneous distributed systems;Abadi,2016

2. The Protein Data Bank, 1999–

3. The Protein Data Bank: a computer-based archival file for macromolecular structures;Bernstein;J. Mol. Biol,1977

4. BLAST+: architecture and applications;Camacho;BMC Bioinformatics,2009

5. Improved residue contact prediction using support vector machines and a large feature set;Cheng;BMC Bioinformatics,2007

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. TopEC: Improved classification of enzyme function by a localized 3D protein descriptor and 3D Graph Neural Networks;2024-02-02

2. Estimation of model accuracy by a unique set of features and tree-based regressor;Scientific Reports;2022-08-18

3. Flooding Prognostic in Packed Columns Based on Electrical Capacitance Tomography and Convolution Neural Network;IEEE Transactions on Instrumentation and Measurement;2022

4. Deep learning for protein secondary structure prediction: Pre and post-AlphaFold;Computational and Structural Biotechnology Journal;2022

5. ACHP: A Web Server for Predicting Anti-Cancer Peptide and Anti-Hypertensive Peptide;International Journal of Peptide Research and Therapeutics;2021-05-17