Author:
Chen Jiaxing,Ng Yen Kaow,Lin Lu,Zhang Xianglilan,Li Shuaicheng
Abstract
Abstract
Background
Distance functions are fundamental for evaluating the differences between gene expression profiles. Such a function would output a low value if the profiles are strongly correlated—either negatively or positively—and vice versa. One popular distance function is the absolute correlation distance, $$d_a=1-|\rho |$$
d
a
=
1
-
|
ρ
|
, where $$\rho$$
ρ
is similarity measure, such as Pearson or Spearman correlation. However, the absolute correlation distance fails to fulfill the triangle inequality, which would have guaranteed better performance at vector quantization, allowed fast data localization, as well as accelerated data clustering.
Results
In this work, we propose $$d_r=\sqrt{1-|\rho |}$$
d
r
=
1
-
|
ρ
|
as an alternative. We prove that $$d_r$$
d
r
satisfies the triangle inequality when $$\rho$$
ρ
represents Pearson correlation, Spearman correlation, or Cosine similarity. We show $$d_r$$
d
r
to be better than $$d_s=\sqrt{1-\rho ^2}$$
d
s
=
1
-
ρ
2
, another variant of $$d_a$$
d
a
that satisfies the triangle inequality, both analytically as well as experimentally. We empirically compared $$d_r$$
d
r
with $$d_a$$
d
a
in gene clustering and sample clustering experiment by real-world biological data. The two distances performed similarly in both gene clustering and sample clustering in hierarchical clustering and PAM (partitioning around medoids) clustering. However, $$d_r$$
d
r
demonstrated more robust clustering. According to the bootstrap experiment, $$d_r$$
d
r
generated more robust sample pair partition more frequently (P-value $$<0.05$$
<
0.05
). The statistics on the time a class “dissolved” also support the advantage of $$d_r$$
d
r
in robustness.
Conclusion
$$d_r$$
d
r
, as a variant of absolute correlation distance, satisfies the triangle inequality and is capable for more robust clustering.
Funder
CityU/UGC Research Matching Grant Scheme
National Key R &D Program of China Grants
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献