Missing data imputation using correlation coefficient and min-max normalization weighting-Reference-Cited by-同舟云学术

Missing data imputation using correlation coefficient and min-max normalization weighting

Published:2024-07-21 Issue: Volume: Page:1-15
ISSN:1088-467X
Container-title:Intelligent Data Analysis
language:
Short-container-title:IDA

Author:

Shantal Mohammed¹,Othman Zalinda²,Abu Bakar Azuraliza²

Affiliation:

1. Computer Science Department, Sebha University, Sebha, Libya

2. The Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Selangor, Malaysia

Abstract

Missing data is one of the challenges a researcher encounters while attempting to draw information from data. The first step in solving this issue is to have the data stage ready for processing. Much effort has been made in this area; removing instances with missing data is a popular method for handling missing data, but it has drawbacks, including bias. It will be impacted negatively on the results. How missing values are handled depends on several vectors, including data types, missing rates, and missing mechanisms. It covers missing data patterns as well as missing at random, missing at completely random, and missing not at random. Other suggestions include using numerous imputation techniques divided into various categories, such as statistical and machine learning methods. One strategy to improve a model’s output is to weight the feature values to better the performance of classification or regression approaches. This research developed a new imputation technique called correlation coefficient min-max weighted imputation (CCMMWI). It combines the correlation coefficient and min-max normalization techniques to balance the feature values. The proposed technique seeks to increase the contribution of features by considering how those elements relate to the desired functionality. We evaluated several established techniques to assess the findings, including statistical techniques, mean and EM imputation, and machine learning imputation techniques, including k-NNI, and MICE. The evaluation also used the imputation techniques CBRL, CBRC, and ExtraImpute. We use various sizes of datasets, missing rates, and random patterns. To compare the imputed datasets and original data, we finally provide the findings and assess them using the root mean squared error (RMSE), mean absolute error (MAE), and R2. According to the findings, the proposed CCMMWI performs better than most other solutions in practically all missing-rate scenarios.

Publisher

IOS Press

Reference54 articles.

1. Big data analytics for electricity theft detection in smart grids;Khan;2021 IEEE Madrid PowerTech,2021

2. Generating synthetic missing data: A review by missing mechanism;Santos;IEEE Access,2019

3. S.F. Wu, C.Y. Chang and S.J. Lee, Time series forecasting with missing values, in 2015 1st International Conference on Industrial Networks and Intelligent Systems (INISCom) (2015), 151–156.

4. I. Chlioui, I. Abnane and A. Idri, Comparing statistical and machine learning imputation techniques in breast cancer classification, in Computational Science and Its Application–ICCSA 2020: 20th International Conference, Cagliari, Italy, July 1-4, 2020, Proceedings, Part IV 20 (2020), pp. 61–76.

5. C. Yan, J. Yuan, Z. Ye and Z. Yang, A Discrete Missing Data Imputation Method Based on Improved Multi-layer Perceptron, in 2021 11th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS) 1 (2021), pp. 480–484.