OsamorSoft: clustering index for comparison and quality validation in high throughput dataset-Reference-Cited by-同舟云学术

OsamorSoft: clustering index for comparison and quality validation in high throughput dataset

Published:2020-07-09 Issue:1 Volume:7 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Osamor Ifeoma Patricia,Osamor Victor Chukwudi^ORCID

Abstract

AbstractThe existence of some differences in the results obtained from varying clustering k-means algorithms necessitated the need for a simplified approach in validation of cluster quality obtained. This is partly because of differences in the way the algorithms select their first seed or centroid either randomly, sequentially or some other principles influences which tend to influence the final result outcome. Popular external cluster quality validation and comparison models require the computation of varying clustering indexes such as Rand, Jaccard, Fowlkes and Mallows, Morey and Agresti Adjusted Rand Index (ARIMA) and Hubert and Arabie Adjusted Rand Index (ARIHA). In literature, Hubert and Arabie Adjusted Rand Index (ARIHA) has been adjudged as a good measure of cluster validity. Based on ARIHA as a popular clustering quality index, we developed OsamorSoft which constitutes DNA_Omatrix and OsamorSpreadSheet as a tool for cluster quality validation in high throughput analysis. The proposed method will help to bridge the yawning gap created by lesser number of friendly tools available to externally evaluate the ever-increasing number of clustering algorithms. Our implementation was tested alongside with clusters created with four k-means algorithms using malaria microarray data. Furthermore, our results evolved a compact 4-stage OsamorSpreadSheet statistics that our easy-to-use GUI java and spreadsheet-based tool of OsamorSoft uses for cluster quality comparison. It is recommended that a framework be evolved to facilitate the simplified integration and automation of several other cluster validity indexes for comparative analysis of big data problems.

Funder

Covenant University

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-020-00325-6.pdf

Reference54 articles.

1. MacQueen J. Some methods for classification and analysis of multi-variate observations, in Proc. of the Fifth Berkeley Symp. on Math., LeCam, L.M., and Neyman, J., (eds.) Statistics and Probability, 1967.

2. Gower JC, Legendre P. Metric and Euclidean properties of dissimilarity coefficients. J Classif. 1986;3(1):5–48.

3. Batagelj V, Bren M. Comparing resemblance measures. J Classif. 1995;12(1):73–90.

4. Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. A local search approximation algorithm for k-means clustering. Comput Geom. 2004;28(2–3):89–112.

5. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D. On Similarity indices and correction for chance agreement. J Classif. 2006;23(2):301–13.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The workshops on computational applications in secondary metabolite discovery (CAiSMD);Physical Sciences Reviews;2024-05-08

2. Statistical and clustering validation analysis of primary students' learning outcomes and self-awareness of information and technical online security problems at a post-pandemic time;Education and Information Technologies;2022-11-17

3. Comparative analysis of features extraction techniques for black face age estimation;AI & SOCIETY;2022-03-25

4. Community-Acquired Pneumonia Recognition by Wavelet Entropy and Cat Swarm Optimization;Mobile Networks and Applications;2022-02-21

5. Computational Applications in Secondary Metabolite Discovery (CAiSMD): an online workshop;Journal of Cheminformatics;2021-09-06