Reproducible Clusters from Microarray Research: Whither?-Reference-Cited by-同舟云学术

Reproducible Clusters from Microarray Research: Whither?

Published:2005-07 Issue:S2 Volume:6 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Garge Nikhil R,Page Grier P,Sprague Alan P,Gorman Bernard S,Allison David B

Abstract

Abstract Motivation In cluster analysis, the validity of specific solutions, algorithms, and procedures present significant challenges because there is no null hypothesis to test and no 'right answer'. It has been noted that a replicable classification is not necessarily a useful one, but a useful one that characterizes some aspect of the population must be replicable. By replicable we mean reproducible across multiple samplings from the same population. Methodologists have suggested that the validity of clustering methods should be based on classifications that yield reproducible findings beyond chance levels. We used this approach to determine the performance of commonly used clustering algorithms and the degree of replicability achieved using several microarray datasets. Methods We considered four commonly used iterative partitioning algorithms (Self Organizing Maps (SOM), K-means, Clutsering LARge Applications (CLARA), and Fuzzy C-means) and evaluated their performances on 37 microarray datasets, with sample sizes ranging from 12 to 172. We assessed reproducibility of the clustering algorithm by measuring the strength of relationship between clustering outputs of subsamples of 37 datasets. Cluster stability was quantified using Cramer's v 2 from a kXk table. Cramer's v 2 is equivalent to the squared canonical correlation coefficient between two sets of nominal variables. Potential scores range from 0 to 1, with 1 denoting perfect reproducibility. Results All four clustering routines show increased stability with larger sample sizes. K-means and SOM showed a gradual increase in stability with increasing sample size. CLARA and Fuzzy C-means, however, yielded low stability scores until sample sizes approached 30 and then gradually increased thereafter. Average stability never exceeded 0.55 for the four clustering routines, even at a sample size of 50. These findings suggest several plausible scenarios: (1) microarray datasets lack natural clustering structure thereby producing low stability scores on all four methods; (2) the algorithms studied do not produce reliable results and/or (3) sample sizes typically used in microarray research may be too small to support derivation of reliable clustering results. Further research should be directed towards evaluating stability performances of more clustering algorithms on more datasets specially having larger sample sizes with larger numbers of clusters considered.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

http://link.springer.com/content/pdf/10.1186/1471-2105-6-S2-S10.pdf

Reference34 articles.

1. Bryan J: Problems in gene clustering based on gene expression data. Journal of Multivariate Analysis 2004, 90: 44–66.

2. Mehta T, Tanik M, Allison DB: Towards sound epistemological foundations of statistical methods for high-dimensional biology. Nature Genetics 2004, 36: 943–7.

3. McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R: Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics 2002, 18: 1462–1469.

4. Roth V, Braun ML, Lange T, Buhmann JM: Stability-based model order selection in clustering with applications to gene expression data. Lecture Notes in Computer Science 2002, 2415: 607–612.

5. Blashfield RK, Aldenderfer MS: The Methods and Problems of Cluster Analysis. In Handbook of Multivariate Experimental Psychology. 2nd edition. Edited by: Nesselroade JR, Cattel RB. New York: Plenum; 1988:447–473.

Cited by 44 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. On data normalization and batch-effect correction for tumor subtyping with microRNA data;NAR Genomics and Bioinformatics;2023-01-10

2. The impact of the COVID-19 pandemic on O-D flow and airport networks in the origin country and in Northeast Asia;Journal of Air Transport Management;2022-05

3. Using Nonlinear Dynamics and Multivariate Statistics to Analyze EEG Signals of Insomniacs with the Intervention of Superficial Acupuncture;Evidence-Based Complementary and Alternative Medicine;2020-11-17

4. Challenges of Clustering Multimodal Clinical Data: Review of Applications in Asthma Subtyping;JMIR Medical Informatics;2020-05-28

5. Challenges of Clustering Multimodal Clinical Data: Review of Applications in Asthma Subtyping (Preprint);2019-09-30