Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning: A Case Study on a Coronary Artery Disease Patient Cohort (Preprint)

Author:

Ghasemi PeymanORCID,Lee JoonORCID

Abstract

BACKGROUND

The application of machine learning in healthcare often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the "curse of dimensionality" and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD/ATC codes and the hierarchical structures of these systems.

OBJECTIVE

The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of coronary artery disease patients in different aspects of performance and complexity and select the best set of features representing these patients.

METHODS

We compared several unsupervised feature selection methods for two ICD and one ATC code databases of 51,506 coronary artery disease patients in Alberta, Canada. Specifically, we employed Laplacian Score, Unsupervised Feature Selection for Multi-Cluster Data, Autoencoder Inspired Unsupervised Feature Selection, Principal Feature Analysis, and Concrete Autoencoders with and without ICD/ATC tree weight adjustment to select the 100 best features from over 9,000 ICD and 2,000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of selected features by mean code level in ICD/ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis.

RESULTS

In feature space reconstruction and mortality prediction, the Concrete Autoencoder-based methods outperformed other techniques. A weight-adjusted Concrete Autoencoder variant, particularly, demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong's and McNemar's tests (P<.05). Concrete Autoencoders preferred more general codes and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted Concrete Autoencoders yielded higher Shapley values in mortality prediction compared to most alternatives.

CONCLUSIONS

This study scrutinized five feature selection methods in ICD/ATC code datasets in an unsupervised context. Our findings underscore the superiority of the Concrete Autoencoder method in selecting salient features that represent the entire dataset, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the Concrete Autoencoders specifically tailored for ICD/ATC code datasets to enhance the generalizability and interpretability of the selected features.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3