Dealing with the Missing, Imbalanced and Sparse Features Problems in Emergency Data Using Random Forest, K-means and PCA Respectively (Preprint)

Author:

Chen XiaojieORCID,Chen Han,Nan ShanORCID,Kong Xiangtian,Duan Huilong,Zhu HaiyanORCID

Abstract

BACKGROUND

In emergency departments (ED), timely rescue is very important as patients’ conditions usually deteriorate rapidly. Early diagnosis can increase patients’ chances of survival. Early diagnosis can be improved by predictive models based on machine learning using Electronic Medical Record (EMR) data. However, ED data are usually imbalanced, having missing values and sparse features. These quality issues make it challenging to build early identification models for diseases in ED.

OBJECTIVE

The objective of this study is to propose a systematic approach to deal with missing, imbalanced and sparse feature problems of ED data.

METHODS

We used random forest and K-means algorithms to interpolate missing values and under-sample data. Regarding sparse features, we used principal component analysis to reduce dimensions. For continuous and discrete variables, the decision coefficient R2 and Kappa coefficient are used to evaluate the performance respectively. The area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPRC) are used to estimate the model performance. To further evaluate the proposed approach, we carried out a case study using an ED dataset extracted from Hainan Hospital of Chinese PLA General Hospital. A logistic regression model for patient condition worsening prediction was built out of the data processed by the proposed approach.

RESULTS

A total of 1085 patients with rescue record and 17959 patients without rescue record were collected, which were significantly imbalanced. 275, 402 and 891 variables are extracted from laboratory tests, medications and diagnosis, respectively. After data preprocessing, the median R2 of random forest interpolation for continuous variables is 0.623 (IQR: 0.647), and the median of Kappa coefficient for discrete variable interpolation is 0.444 (IQR: 0.285). The logistic regression model constructed using the initial diagnostic data has poor performance and variable separation, which is reflected in the abnormally high OR values of the two variables of cardiac arrest and respiratory arrest (27857.4 and 9341.6) and an abnormal confidence interval. Using the processed data, the recall of the model reaches 0.77, F1-SCORE is 0.74, and AUC is 0.64.

CONCLUSIONS

We proposed a machine learning method to deal with data quality issues such as missing data, data imbalance, and sparse features in emergency data, so as to improve data availability. A preliminary case study indicate the results produced by the proposed method can be used for building prediction model for emergency patients.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3