Abstract
AbstractThe exponential growth of data coupled with the widespread application of artificial intelligence(AI) presents organizations with challenges in upholding data accuracy, especially within data engineering functions. While the Extraction, Transformation, and Loading process addresses error-free data ingestion, validating the content within data streams remains a challenge. Prompt detection and remediation of data issues are crucial, especially in automated analytical environments driven by AI. To address these issues, this study focuses on detecting drifts in data distributions and divergence within data fields processed from different sample populations. Using a hypothetical banking scenario, we illustrate the impact of data drift on automated decision-making processes. We propose a scalable method leveraging the Kullback-Leibler (KL) divergence measure, specifically the Population Stability Index (PSI), to detect and quantify data drift. Through comprehensive simulations, we demonstrate the effectiveness of PSI in identifying and mitigating data drift issues. This study contributes to enhancing data engineering functions in organizations by offering a scalable solution for early drift detection in data ingestion pipelines. We discuss related research works, identify gaps, and present the methodology and experiment results, underscoring the importance of robust data governance practices in mitigating risks associated with data drift and improving data observability.
Publisher
Springer Science and Business Media LLC
Reference16 articles.
1. Abedjan Z, Chu X, Deng D, Fernandez R, Ilyas I, Ouzzani M, Papotti P, Stonebraker M, Tang N (2016) Detecting data errors: Where are we and what needs to be done? Proceedings of the VLDB Endowment 9(12):993–1004. https://doi.org/10.14778/2994509.2994518
2. Basterrech S, Wozniak M (2022) Tracking changes using kullback-leibler divergence for the continual learning, pp 3279–3285. https://doi.org/10.1109/SMC53654.2022.9945547
3. Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia H (2014) A survey on concept drift adaptation. ACM Computing Surveys (CSUR) 46. https://doi.org/10.1145/2523813
4. Ghomeshi H, Gaber MM, Kovalchuk Y (2019) Eacd: evolutionary adaptation to concept drifts in data streams. Data Mining and Knowledge Discovery 33(3):663–694
5. Gudivada V, Apon A, Ding J (2017) Data quality considerations for big data and machine learning: going beyond data cleaning and transformations. Int J Adv Softw 10(1):1–20