Data-driven versus a domain-led approach to k-means clustering on an open heart failure dataset
-
Published:2022-07-25
Issue:1
Volume:15
Page:49-66
-
ISSN:2364-415X
-
Container-title:International Journal of Data Science and Analytics
-
language:en
-
Short-container-title:Int J Data Sci Anal
Author:
Jasinska-Piadlo A.ORCID, Bond R.ORCID, Biglarbeigi P.ORCID, Brisk R.ORCID, Campbell P.ORCID, Browne F.ORCID, McEneaneny D.ORCID
Abstract
AbstractDomain-driven data mining of health care data poses unique challenges. The aim of this paper is to explore the advantages and the challenges of a ‘domain-led approach’ versus a data-driven approach to a k-means clustering experiment. For the purpose of this experiment, clinical experts in heart failure selected variables to be used during the k-means clustering, whilst during the ‘data-driven approach’ feature selection was performed by applying principal component analysis to the multidimensional dataset. Six out of seven features selected by physicians were amongst 26 features that contributed most to the significant principal components within the k-means algorithm. The data-driven approach showed advantage over the domain-led approach for feature selection by removing the risk of bias that can be introduced by domain experts. Whilst the ‘domain-led approach’ may potentially prohibit knowledge discovery that can be hidden behind variables not routinely taken into consideration as clinically important features, the domain knowledge played an important role at the interpretation stage of the clustering experiment providing insight into the context and preventing far fetched conclusions. The “data-driven approach” was accurate in identifying clusters with distinct features at the physiological level. To promote the domain-led data mining approach, as a result of this experiment we developed a practical checklist guiding how to enable the integration of the domain knowledge into the data mining project.
Funder
Public Health Agency Northern Ireland Health and Social Care Trust
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Computational Theory and Mathematics,Computer Science Applications,Modeling and Simulation,Information Systems
Reference53 articles.
1. Reddy, C.K., Aggarwal, C.C.: Healthcare Data Analytics, vol. 36. CRC Press, Boca Raton (2015) 2. Kopanas,I., Avouris,N.M., Daskalaki,S.: in Hellenic Conference on Artificial Intelligence (Springer, 2002), pp. 288–299 3. Nagamine, T., Gillette, B., Pakhomov, A., Kahoun, J., Mayer, H., Burghaus, R., Lippert, J., Saxena, M.: Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured electronic medical record data. Sci. Rep. 10(1), 1 (2020) 4. Gu, J., Pan, J.A., Lin, H., Zhang, J.F., Wang, C.Q.: Characteristics, prognosis and treatment response in distinct phenogroups of heart failure with preserved ejection fraction. Int. J. Cardiol. 323, 148 (2021) 5. Schrub, F., Oger, E., Bidaut, A., Hage, C., Charton, M., Daubert, J.C., Leclercq, C., Linde, C., Lund, L., Donal, E.: Heart failure with preserved ejection fraction: a clustering approach to a heterogenous syndrome. Arch. Cardiovasc. Dis. 113(6–7), 381 (2020)
Cited by
11 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|