Identification of Disease-specific Single Amino Acid Polymorphisms Using a Simple Random Forest at Protein-level
-
Published:2021-12-02
Issue:10
Volume:16
Page:1278-1287
-
ISSN:1574-8936
-
Container-title:Current Bioinformatics
-
language:en
-
Short-container-title:CBIO
Author:
He Jian1,
Yuan Rongao2,
Xu Lei1,
Guo Yanzhi1,
Li Menglong1
Affiliation:
1. College of Chemistry, Sichuan University, Chengdu, China
2. College of Computer Science, Sichuan University, Chengdu, China
Abstract
Background:
The number of human genetic variants deposited into publicly available databases
has been increasing exponentially. Among these variants, non-synonymous single nucleotide
polymorphisms (nsSNPs), also known as single Amino Acid Polymorphisms (SAPs), have been
demonstrated to be strongly correlated with phenotypic variations of traits/diseases.
Objective:
However, the detailed mechanisms governing the disease association of SAPs remain unclear.
Thus, further investigation of new attributes and improvement of the prediction becomes more
and more urgent since amount of unknown disease-related SAPs need to be investigated.
Methods:
Based on the principle of Random Forest (RF), we firstly constructed a new effective prediction
model for SAPs associated with a particular disease from protein sequences. Four usual sequence
signature extractions were separately performed to select the optimal features. Then SAP peptide
lengths from 12 to 202 were also optimized.
Results:
The optimal models achieve higher than 90% accuracy and Area Under the Curve (AUC) of
over 0.9 on all 11 external testing datasets. Finally, the good performance on an independent test set
with an accuracy higher than 95% proves the superiority of our method.
Conclusion:
In this paper, based on Random Forest (RF), we constructed 11 disease-association prediction
models for SAPs from the protein sequence level. All models yield prediction accuracy higher
than 90% and Area Under the Curve (AUC) more than 0.9. Our method only using the information of
protein sequences are more universal than those that depend on some additional information or predictions
about the proteins.
Publisher
Bentham Science Publishers Ltd.
Subject
Computational Mathematics,Genetics,Molecular Biology,Biochemistry