PTGAC Model: A machine learning approach for constructing phylogenetic tree to compare protein sequences-Reference-Cited by-同舟云学术

PTGAC Model: A machine learning approach for constructing phylogenetic tree to compare protein sequences

Published:2023-02 Issue:01 Volume:21 Page:
ISSN:0219-7200
Container-title:Journal of Bioinformatics and Computational Biology
language:en
Short-container-title:J. Bioinform. Comput. Biol.

Author:

Pal Jayanta¹²^ORCID,Saha Sourav²,Maji Bansibadan¹,Bhattacharya Dilip Kumar³

Affiliation:

1. Department of ECE, National Institute of Technology, Durgapur, West Bengal 713209, India

2. Department of CSE, Narula Institute of Technology, Kolkata, West Bengal 700109, India

3. Department of Pure Mathematics, Calcutta University, Kolkata, India

Abstract

This work proposes a machine learning-based phylogenetic tree generation model based on agglomerative clustering (PTGAC) that compares protein sequences considering all known chemical properties of amino acids. The proposed model can serve as a suitable alternative to the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), which is inherently time-consuming in nature. Initially, principal component analysis (PCA) is used in the proposed scheme to reduce the dimensions of 20 amino acids using seven known chemical characteristics, yielding 20 TP (Total Points) values for each amino acid. The approach of cumulative summing is then used to give a non-degenerate numeric representation of the sequences based on these 20 TP values. A special kind of three-component vector is proposed as a descriptor, which consists of a new type of non-central moment of orders one, two, and three. Subsequently, the proposed model uses Euclidean Distance measures among the descriptors to create a distance matrix. Finally, a phylogenetic tree is constructed using hierarchical agglomerative clustering based on the distance matrix. The results are compared with the UPGMA and other existing methods in terms of the quality and time of constructing the phylogenetic tree. Both qualitative and quantitative analysis are performed as key assessment criteria for analyzing the performance of the proposed model. The qualitative analysis of the phylogenetic tree is performed by considering rationalized perception, while the quantitative analysis is performed based on symmetric distance (SD). On both criteria, the results obtained by the proposed model are more satisfactory than those produced earlier on the same species by other methods. Notably, this method is found to be efficient in terms of both time and space requirements and is capable of dealing with protein sequences of varying lengths.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Computer Science Applications,Molecular Biology,Biochemistry

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0219720022500287

Reference20 articles.

1. Alignment-free inference of hierarchical and reticulate phylogenomic relationships

2. Alignment-free sequence comparison: benefits, applications, and tools

3. Alignment-free sequence comparison--a review

4. Genome sequence comparison under a new form of tri-nucleotide representation based on bio-chemical properties of nucleotides

5. A new graphical representation of protein sequences and its applications