Smart Data Prefetching Using KNN to Improve Hadoop Performance-Reference-Cited by-同舟云学术

Smart Data Prefetching Using KNN to Improve Hadoop Performance

Published:2023-08-04 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Ghazali Rana¹,Down Douglas G.²

Affiliation:

1. Islamic Azad University

2. McMaster University

Abstract

Abstract Hadoop is an open-source framework that enables the parallel processing of large data sets across a cluster of machines. It faces several challenges that can lead to poor performance, such as I/O operations, network data transmission, and high data access time. In recent years, researchers have explored prefetching techniques to reduce the data access time as a potential solution to these problems. Nevertheless, several issues must be considered to optimize the prefetching mechanism. These include launching the prefetch at an appropriate time to avoid conflicts with other operations and minimize waiting time, determining the amount of prefetched data to avoid overload and underload, and placing the prefetched data in a location that can be accessed efficiently when required. In this paper, we propose a smart prefetch mechanism that consists of three phases designed to address these issues. First, we enhance the task progress rate to calculate the optimal time for triggering prefetch operations. Next, we utilize K-Nearest Neighbor (KNN) clustering to identify which data blocks should be prefetched in each round, employing the data locality feature to determine the placement of prefetched data. Our experimental results demonstrate that our proposed smart prefetch mechanism improves job execution time by an average of 28.33% by increasing the rate of local tasks.

Publisher

Research Square Platform LLC

Reference16 articles.

1. “Apache Hadoop” http://Hadoop.apache.org/

2. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1, (2008)

3. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System, IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), (2010)

4. Li, H., Jiang, H., Wang, D., Han, B., An improved KNN algorithm for text classification, Eighth International Conference on Instrumentation, Measurement: Computer, Communication and Control IMCCC, pp. 1081–1085. (2018) (2018)

5. Luo, Y., Shi, J., Zhou, S., JeCache: Just-Enough Data Caching with Just-in-Time Prefetching for Big Data Applications. Proceedings - International Conference on Distributed Computing Systems 2405–2410 doi: (2017). 10.1109/ICDCS.2017.268