An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection application-Reference-Cited by-同舟云学术

An enhanced random forest approach using CoClust clustering: MIMIC-III and SMS spam collection application

Published:2023-03-30 Issue:1 Volume:10 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Ilhan Taskin Zeynep,Yildirak Kasirga,Aladag Cagdas Hakan

Abstract

AbstractThe random forest algorithm could be enhanced and produce better results with a well-designed and organized feature selection phase. The dependency structure between the variables is considered to be the most important criterion behind selecting the variables to be used in the algorithm during the feature selection phase. As the dependency structure is mostly nonlinear, making use of a tool that considers nonlinearity would be a more beneficial approach. Copula-Based Clustering technique (CoClust) clusters variables with copulas according to nonlinear dependency. We show that it is possible to achieve a remarkable improvement in CPU times and accuracy by adding the CoClust-based feature selection step to the random forest technique. We work with two different large datasets, namely, the MIMIC-III Sepsis Dataset and the SMS Spam Collection Dataset. The first dataset is large in terms of rows referring to individual IDs, while the latter is an example of longer column length data with many variables to be considered. In the proposed approach, first, random forest is employed without adding the CoClust step. Then, random forest is repeated in the clusters obtained with CoClust. The obtained results are compared in terms of CPU time, accuracy and ROC (receiver operating characteristic) curve. CoClust clustering results are compared with K-means and hierarchical clustering techniques. The Random Forest, Gradient Boosting and Logistic Regression results obtained with these clusters and the success of RF and CoClust working together are examined.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-023-00720-9.pdf

Reference61 articles.

1. Darwiche Aiman A. 2018. “Machine learning methods for septic shock prediction.” PhD Thesis, Nova Southeastern University. Retrieved from NSUWorks, College of Engineering and Computing. (1051) https://nsuworks.nova.edu/gscis_etd/1051

2. Lee J. Patient-specific predictive modeling using random forests: an observational study for the critically Ill. JMIR Med Informat. 2017. https://doi.org/10.2196/medinform.6690.

3. Levantesi S, Nigri A. A random forest algorithm to improve the Lee-carter mortality forecasting: impact on q-forward. Soft Comput. 2020;24(12):8553–67. https://doi.org/10.1007/s00500-019-04427-z.

4. McWilliams CJ, et al. Towards a decision support tool for ıntensive care discharge: machine learning algorithm development using electronic healthcare data from MIMIC-III and Bristol, UK. BMJ Open. 2019. https://doi.org/10.1136/bmjopen-2018-025925.

5. Mistry P, Neagu D, Trundle PR, Vessey JD. Using random forest and decision tree models for a new vehicle prediction approach in computational toxicology. Soft Comput. 2016;20(8):2967–79. https://doi.org/10.1007/s00500-015-1925-9.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Revolutionizing anemia detection: integrative machine learning models and advanced attention mechanisms;Visual Computing for Industry, Biomedicine, and Art;2024-07-17

2. Dynamic Function Generation for Text Classification;2024 IEEE Congress on Evolutionary Computation (CEC);2024-06-30

3. Integrating Unsupervised and Supervised ML Models for Analysis of Synthetic Data From VAE, GAN, and Clustering of Variables;International Journal of Data Analytics;2024-05-10

4. Investigating Evasive Techniques in SMS Spam Filtering: A Comparative Analysis of Machine Learning Models;IEEE Access;2024

5. CHARACTERIZATION OF MORTALITY PREDICTION: AN ENSEMBLE LEARNING ANALYSIS USING THE MIMIC-III DATASET;Journal of Scientific Reports-A;2023-09-30