Affiliation:
1. IDS Team, Abdelmalek Essaadi University, Tangier, Morocco
Abstract
Data cleaning, also referred to as data cleansing, constitutes a pivotal phase in data processing subsequent to data collection. Its primary objective is to identify and eliminate incomplete data, duplicates, outdated information, anomalies, missing values, and errors. The influence of data quality on the effectiveness of machine learning (ML) models is widely acknowledged, prompting data scientists to dedicate substantial effort to data cleaning prior to model training. This study accentuates critical facets of data cleaning and the utilization of outlier detection algorithms. Additionally, our investigation encompasses the evaluation of prominent outlier detection algorithms through benchmarking, seeking to identify an efficient algorithm boasting consistent performance. As the culmination of our research, we introduce an innovative algorithm centered on the fusion of Isolation Forest and clustering techniques. By leveraging the strengths of both methods, this proposed algorithm aims to enhance outlier detection outcomes. This work endeavors to elucidate the multifaceted importance of data cleaning, underscored by its symbiotic relationship with ML models. Furthermore, our exploration of outlier detection methodologies aligns with the broader objective of refining data processing and analysis paradigms. Through the convergence of theoretical insights, algorithmic exploration, and innovative proposals, this study contributes to the advancement of data cleaning and outlier detection techniques in the realm of contemporary data-driven environments.
Reference28 articles.
1. J.M. Wing, The data life cycle, Harvard Data Science Review 1(1) (2019), 6.
2. Qualitative data analysis: An overview of data reduction, data display, and interpretation;Mezmir;Research on humanities and social sciences,2020
3. Data cleaning: Problems and current approaches;Rahm;IEEE Data Eng. Bull.,2000
4. Data cleaning in the process industries;Xu;Reviews in Chemical Engineering,2015
5. C. Xu et al., Data cleaning: Overview and emerging challenges, in: Proceedings of the 2016 International Conference on Management of Data, 2016.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献