Affiliation:
1. School of Information, Yunnan Normal University, Kunming 650500, China
2. Engineering Research Center of Computer Vision and Intelligent Control Technology, Yunnan Provincial Department of Education, Kunming 650500, China
Abstract
Due to the rapid development of the mobile Internet and the Internet of Things, the volume of generated data keeps growing. The topic of data quality has gained increasing attention recently. Numerous studies have explored various data quality (DQ) problems across several fields, with corresponding effective data-cleaning strategies being researched. This paper begins with a comprehensive and systematic review of studies related to DQ. On the one hand, we classify these DQ-related studies into six types: redundant data, missing data, noisy data, erroneous data, conflicting data, and sparse data. On the other hand, we discuss the corresponding data-cleaning strategies for each DQ type. Secondly, we examine DQ issues and potential solutions for a public bus transportation system, utilizing a real-world traffic big data platform. Finally, we provide two representative examples, noise filtering and filling missing values, to demonstrate the DQ improvement practice. The experimental results show that: (1) The GPS noise filtering solution we proposed surpasses the baseline and achieves an accuracy of 97%; (2) The multi-source data fusion method can achieve a 100% missing repair rate (MRR) for bus arrival and departure. The average relative error (ARE) of bus arrival and departure times at stations is less than 1%, and the correlation coefficient (R) is also close to 1. Our research can offer guidance and lessons for enhancing data governance and quality improvement in the bus transportation system.
Funder
National Natural Science Foundation of China
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference41 articles.
1. Survey of structured data cleaning methods;Hao;J. Tsinghua Univ. (Sci. Technol.),2018
2. Redman, T.C. (2016). Getting in Front on Data: Who Does What, Technics Publications. Chapter 2.
3. An Overview of Data Quality Frameworks;Cichy;IEEE Access,2019
4. A comparative study of data cleaning tools;Oni;Int. J. Data Warehous. Min.,2019
5. Jin, G., Liang, Y., Fang, Y., Huang, J., Zhang, J., and Zheng, Y. (2023). Spatio-temporal graph neural networks for predictive learning in urban computing: A survey. arXiv.