Qluster: An easy-to-implement generic workflow for robust clustering of health data-Reference-Cited by-同舟云学术

Qluster: An easy-to-implement generic workflow for robust clustering of health data

Published:2023-02-06 Issue: Volume:5 Page:
ISSN:2624-8212
Container-title:Frontiers in Artificial Intelligence
language:
Short-container-title:Front. Artif. Intell.

Author:

Esnault Cyril,Rollot Melissa,Guilmin Pauline,Zucker Jean-Daniel

Abstract

The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.

Publisher

Frontiers Media SA

Subject

Artificial Intelligence

Reference123 articles.

1. Evaluation of Clusterings -- Metrics and Visual Support

2. Survey of state-of-the-art mixed data clustering algorithms;Ahmad;IEEE Access,2019

3. Clustering 0with deep learning: taxonomy and new methods;Aljalbout;arXiv:1801.07648.,2018

4. Clustering ensemble method;Alqurashi;Int. J. Mach. Learn. Cyber,2019

5. Clustering;Altman;Nat. Methods,2017

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A reference architecture for personal health data spaces using decentralized content-addressable storage networks;Frontiers in Medicine;2024-07-16

2. Onset of a conceptual outline map to get a hold on the jungle of cluster analysis;WIREs Data Mining and Knowledge Discovery;2024-07-11

3. Protocol for the development of a tool to map systemic sclerosis pain sources, patterns, and management experiences: a Scleroderma Patient-centered Intervention Network patient-researcher partnership;BMC Rheumatology;2024-06-21

4. Optimizing data regeneration and storage with data dependency for cloud scientific workflow systems;Expert Systems with Applications;2024-03