Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation-Reference-Cited by-同舟云学术

Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation

Published:2018-04-17 Issue:2 Volume:21 Page:100-110
ISSN:1386-2073
Container-title:Combinatorial Chemistry & High Throughput Screening
language:en
Short-container-title:CCHTS

Author:

Li Chun¹,Zhao Jialing²,Wang Changzhong²,Yao Yuhua¹

Affiliation:

1. School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China

2. Department of Mathematics, Bohai University, Jinzhou 121013, China

Abstract

Aim and Objective: The rapid increase in the amount of protein sequence data available leads to an urgent need for novel computational algorithms to analyze and compare these sequences. This study is undertaken to develop an efficient computational approach for timely encoding protein sequences and extracting the hidden information. Methods: Based on two physicochemical properties of amino acids, a protein primary sequence was converted into a three-letter sequence, and then a graph without loops and multiple edges and its geometric line adjacency matrix were obtained. A generalized PseAAC (pseudo amino acid composition) model was thus constructed to characterize a protein sequence numerically. Results: By using the proposed mathematical descriptor of a protein sequence, similarity comparisons among β-globin proteins of 17 species and 72 spike proteins of coronaviruses were made, respectively. The resulting clusters agreed well with the established taxonomic groups. In addition, a generalized PseAAC based SVM (support vector machine) model was developed to identify DNA-binding proteins. Experiment results showed that our method performed better than DNAbinder, DNA-Prot, iDNA-Prot and enDNA-Prot by 3.29-10.44% in terms of ACC, 0.056-0.206 in terms of MCC, and 1.45-15.76% in terms of F1M. When the benchmark dataset was expanded with negative samples, the presented approach outperformed the four previous methods with improvement in the range of 2.49-19.12% in terms of ACC, 0.05-0.32 in terms of MCC, and 3.82- 33.85% in terms of F1M. Conclusion: These results suggested that the generalized PseAAC model was very efficient for comparison and analysis of protein sequences, and very competitive in identifying DNA-binding proteins.

Publisher

Bentham Science Publishers Ltd.

Subject

Organic Chemistry,Computer Science Applications,Drug Discovery,General Medicine

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Singular value thresholding two-stage matrix completion for drug sensitivity discovery;Computational Biology and Chemistry;2024-06

2. Protein sequence comparison based on representation on a finite dimensional unit hypercube;Journal of Biomolecular Structure and Dynamics;2023-10-14

3. Machine learning in genomics: identification and modeling of anticancer peptides;Data Science for Genomics;2023

4. FFP: joint Fast Fourier transform and fractal dimension in amino acid property-aware phylogenetic analysis;BMC Bioinformatics;2022-08-19

5. Phylogenetic Analysis: A Novel Method of Protein Sequence Similarity Analysis;International Journal of Pattern Recognition and Artificial Intelligence;2022-04-25