Author:
Elmobark Nagwa Elmobark Nagwa,El-ghareeb Haitham El-ghareeb Haitham,Elhishi Sara Elhishi Sara
Abstract
Abstract
With the rapid growth of the Internet of Things (IoT) and the emergence of big data, handling massive amounts of data has become a major challenge. Traditional approaches involve sending raw data to cloud data centers for cleaning, processing, and interpretation using data warehouse tools. However, this study introduces BlueEdge, a fog edge mobile application that aims to shift the cleaning and preprocessing tasks from the cloud to the edge. We compare BlueEdge with four popular data cleaning tools (WinPure, DoubleTake, WizSame, and DQGlobal) that operate within data warehouse architectures, such as Hadoop servers. The comparison considers criteria such as time consumption, resource utilization (memory and CPU), and tool performance. BlueEdge utilizes Natural Language Processing (NLP) techniques via the Natural Language Toolkit (NLTK) and Python packages to connect with a real-time database. Our results demonstrate that BlueEdge successfully performs the same cleaning services in real time on the fog edge (mobile). It excels in handling data duplication elimination services, including different spelling and pronunciation, misspellings, name abbreviations, honorific prefixes, common nicknames, and split names. BlueEdge achieves high accuracy percentages, ranging from 90% to 100%, surpassing data warehouse applications running on cloud servers. Additionally, BlueEdge significantly reduces memory consumption to 5,000 bytes per edge on mobile devices, while data warehouses range from 10,000 to 60,000 bytes on Hadoop servers. Moreover, BlueEdge minimizes the data cleaning time to 1 second per edge, compared to the typical 4 to 30 seconds for data warehouses. Notably, BlueEdge can be executed by users on mobile devices without permission, and it is available free of charge.
Publisher
Research Square Platform LLC
Reference14 articles.
1. Akhbardeh F. (2022). NLP and ML Methods for Pre-processing, Clustering and Classification of Technical Logbook Datasets. July. https://scholarworks.rit.edu/theses/11227.
2. Bird S. (2006). NLTK: The natural language toolkit. COLING/ACL 2006–21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Interactive Presentation Sessions, 69–72.
3. Bonomi F, Milito R, Zhu J, Addepalli S. (2012). Fog computing and its role in the internet of things. MCC’12 - Proceedings of the 1st ACM Mobile Cloud Computing Workshop, 13–15. https://doi.org/10.1145/2342509.2342513.
4. Bramantoro A. Telkomnika (Telecommunication Computing Electronics and Control). 2018;16(2):834–42. https://doi.org/10.12928/TELKOMNIKA.V16I2.7669. Data Cleaning Service for Data Warehouse: An Experimental Comparative Study on Local Data.
5. Dong W, Douglis F, Reddy S, Li K, Shilane P, Patterson H. (2017). Tradeoffs in scalable data routing for deduplication clusters. Proceedings of FAST 2011: 9th USENIX Conference on File and Storage Technologies, November 2017, 15–29.