PTGAC Model: A machine learning approach for constructing phylogenetic tree to compare protein sequences

Author:

Pal Jayanta12ORCID,Saha Sourav2,Maji Bansibadan1,Bhattacharya Dilip Kumar3

Affiliation:

1. Department of ECE, National Institute of Technology, Durgapur, West Bengal 713209, India

2. Department of CSE, Narula Institute of Technology, Kolkata, West Bengal 700109, India

3. Department of Pure Mathematics, Calcutta University, Kolkata, India

Abstract

This work proposes a machine learning-based phylogenetic tree generation model based on agglomerative clustering (PTGAC) that compares protein sequences considering all known chemical properties of amino acids. The proposed model can serve as a suitable alternative to the Unweighted Pair Group Method with Arithmetic Mean (UPGMA), which is inherently time-consuming in nature. Initially, principal component analysis (PCA) is used in the proposed scheme to reduce the dimensions of 20 amino acids using seven known chemical characteristics, yielding 20 TP (Total Points) values for each amino acid. The approach of cumulative summing is then used to give a non-degenerate numeric representation of the sequences based on these 20 TP values. A special kind of three-component vector is proposed as a descriptor, which consists of a new type of non-central moment of orders one, two, and three. Subsequently, the proposed model uses Euclidean Distance measures among the descriptors to create a distance matrix. Finally, a phylogenetic tree is constructed using hierarchical agglomerative clustering based on the distance matrix. The results are compared with the UPGMA and other existing methods in terms of the quality and time of constructing the phylogenetic tree. Both qualitative and quantitative analysis are performed as key assessment criteria for analyzing the performance of the proposed model. The qualitative analysis of the phylogenetic tree is performed by considering rationalized perception, while the quantitative analysis is performed based on symmetric distance (SD). On both criteria, the results obtained by the proposed model are more satisfactory than those produced earlier on the same species by other methods. Notably, this method is found to be efficient in terms of both time and space requirements and is capable of dealing with protein sequences of varying lengths.

Publisher

World Scientific Pub Co Pte Ltd

Subject

Computer Science Applications,Molecular Biology,Biochemistry

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3