Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)-Reference-Cited by-同舟云学术

Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans)

Published:2022-06-16 Issue:1 Volume:23 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Ultsch Alfred,Lötsch Jörn

Abstract

AbstractBackgroundData transformations are commonly used in bioinformatics data processing in the context of data projection and clustering. The most used Euclidean metric is not scale invariant and therefore occasionally inappropriate for complex, e.g., multimodal distributed variables and may negatively affect the results of cluster analysis. Specifically, the squaring function in the definition of the Euclidean distance as the square root of the sum of squared differences between data points has the consequence that the value 1 implicitly defines a limit for distances within clusters versus distances between (inter-) clusters.MethodsThe Euclidean distances within a standard normal distribution (N(0,1)) follow a N(0,

$$\sqrt{2}$$

2) distribution. The EDO-transformation of a variable X is proposed as

$$EDO= X/(\sqrt{2}\cdot s)$$

EDO=X/(2·s)following modeling of the standard deviationsby a mixture of Gaussians and selecting the dominant modes via item categorization. The method was compared in artificial and biomedical datasets with clustering of untransformed data, z-transformed data, and the recently proposed pooled variable scaling.ResultsA simulation study and applications to known real data examples showed that the proposed EDO scaling method is generally useful. The clustering results in terms of cluster accuracy, adjusted Rand index and Dunn’s index outperformed the classical alternatives. Finally, the EDO transformation was applied to cluster a high-dimensional genomic dataset consisting of gene expression data for multiple samples of breast cancer tissues, and the proposed approach gave better results than classical methods and was compared with pooled variable scaling.ConclusionsFor multivariate procedures of data analysis, it is proposed to use the EDO transformation as a better alternative to the established z-standardization, especially for nontrivially distributed data. The “EDOtrans” R package is available athttps://cran.r-project.org/package=EDOtrans.

Funder

Deutsche Forschungsgemeinschaft

Landesoffensive zur Entwicklung wissenschaftlich-ökonomischer Exzellenz

Johann Wolfgang Goethe-Universität, Frankfurt am Main

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-022-04769-w.pdf

Reference39 articles.

1. Lötsch J, Ultsch A. Current projection methods-induced biases at subgroup detection for machine-learning based data-analysis of biomedical data. Int J Mol Sci. 2019;21(1):79.

2. Ultsch A, Lötsch J. Machine-learned cluster identification in high-dimensional data. J Biomed Inform. 2017;66:95–104.

3. Hair JF. Multivariate data analysis. Boston: Cengage; 2019.

4. Kim T, Chen IR, Lin Y, Wang AY, Yang JYH, Yang P. Impact of similarity metrics on single-cell RNA-seq data clustering. Brief Bioinform. 2019;20(6):2316–26.

5. Hurewicz W, James H, Nichols N. Filters and servo systems with pulsed data. In: James HM, Nichols NB, Phillips RS, Phillips RS, editors. Theory of servomechanisms, vol. 25. New York: McGraw-Hill; 1947.

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multi-person cooperative positioning of forest rescuers based on inertial navigation system (INS), global navigation satellite system (GNSS), and ZigBee;Measurement;2025-02

2. Water Supply Pipeline Operation Anomaly Mining and Spatiotemporal Correlation Study;Journal of Pipeline Systems Engineering and Practice;2024-11

3. A cell and transcriptome atlas of the human arterial vasculature;2024-09-10

4. A Spatial–Seasonal Study on the Danube River in the Adjacent Danube Delta Area: Case Study—Monitored Heavy Metals;Water;2024-09-02

5. Performance evaluation of attention-deep hashing based medical image retrieval in brain MRI datasets;Journal of Radiation Research and Applied Sciences;2024-09