New Method for Sequence Similarity Analysis Based on the Position and Frequency of Statistically Significant Repeats
-
Published:2021-12-02
Issue:10
Volume:16
Page:1299-1310
-
ISSN:1574-8936
-
Container-title:Current Bioinformatics
-
language:en
-
Short-container-title:CBIO
Author:
Jovanovic Jasmina T.1ORCID
Affiliation:
1. Faculty of Mathematics, University of Belgrade, Studentski trg 16, 11000 Belgrade, Serbia
Abstract
Background:
The analysis of DNA nucleotide sequence similarity among different species is
crucial in identifying their functional, structural or evolutionary relationships. The number of bioinformatics
tools designed to perform the similarity analysis of nucleotide sequences has been growing rapidly.
According to the current literature, alignment-free methods have not been performed on repetitive
nucleotide sequence of different lengths.
Objective:
To develop a new algorithm for determining sequence characteristics and similarity based on
statistically significant repetitive elements of different lengths, which are located in analyzed sequences.
Methods:
This paper presents Repeats-Position/Frequency method (R-P/F method), for determining nucleotide
sequence similarity which takes into consideration statistically significant repetitive parts of analyzed
sequences. It is based on information theory and the fact that both position and frequency of repeated
sequences are not expected to occur with the identical presence in a random sequence of the
same length. Nucleotide sequences are presented in rn-dimensional vector space and their hierarchy is
constructed by applying hierarchical clustering algorithm.
Results:
R-P/F method has been validated on multiple data sets of nucleotide sequences and compared
with results obtained from alignment-based algorithms BLAST and Clustal Omega, and multiple wellestablished
alignment-free dissimilarity measures. Presented method provides results comparable with
other commonly used methods focused on resolving the same problem, with the novel view on the used
repetitive parts of sequences in these calculations.
Conclusion:
The presented, novel algorithm for calculating sequence similarity measure is effective in
discovering relationships among the sequences and makes a powerful and complementary addition to
existing sequence similarity methods.
Publisher
Bentham Science Publishers Ltd.
Subject
Computational Mathematics,Genetics,Molecular Biology,Biochemistry