Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach-Reference-Cited by-同舟云学术

Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach

Published:2023-01-20 Issue: Volume:11 Page:e38590
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Chen Xiaojie^ORCID,Chen Han^ORCID,Nan Shan^ORCID,Kong Xiangtian^ORCID,Duan Huilong^ORCID,Zhu Haiyan^ORCID

Abstract

Background In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients’ chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. Objective This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. Methods We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R2 and the κ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model’s performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. Results A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R2 of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the κ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95% CI. Using processed data, the recall of the model reached 0.746, the F1-score was 0.73, and the AUROC was 0.708. Conclusions The proposed systematic approach is valid for building a prediction model for emergency patients.

Publisher

JMIR Publications Inc.

Subject

Health Information Management,Health Informatics

Reference60 articles.

1. Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured electronic medical record data

2. Risk of mortality and cardiopulmonary arrest in critical patients presenting to the emergency department using machine learning and natural language processing

3. Unexpected death within 72 hours of emergency department visit: were those deaths preventable?

4. Can machine-learning improve cardiovascular risk prediction using routine clinical data?

5. A Real-Time Early Warning System for Monitoring Inpatient Mortality Risk: Prospective Study Using Electronic Medical Record Data

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Interpretable machine learning models for predicting clinical pregnancies associated with surgical sperm retrieval from testes of different etiologies: a retrospective study;BMC Urology;2024-07-29

2. A methodological showcase: utilizing minimal clinical parameters for early-stage mortality risk assessment in COVID-19-positive patients;PeerJ Computer Science;2024-04-30

3. Application effect study of a combination of TeamSTEPPS with modularization teaching in the context of clinical instruction in trauma care;Scientific Reports;2024-02-27

4. A federated learning system with data fusion for healthcare using multi-party computation and additive secret sharing;Computer Communications;2024-02

5. A machine learning-based prediction model for postoperative delirium in cardiac valve surgery using electronic health records;BMC Cardiovascular Disorders;2024-01-18