Affiliation:
1. School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK
Abstract
As a distributed system, Hadoop heavily relies on the network to complete data-processing jobs. While the traffic generated by Hadoop jobs is critical for job execution performance, the actual behaviour of Hadoop network traffic is still poorly understood. This lack of understanding greatly complicates research relying on Hadoop workloads. In this article, we explore Hadoop traffic through empirical traces. We analyse the generated traffic of multiple types of MapReduce jobs, with varying input sizes, and cluster configuration parameters. We present Keddah, a toolchain for capturing, modelling, and reproducing Hadoop traffic, for use with network simulators to better capture the behaviour of Hadoop. By imitating the Hadoop traffic generation process and considering the YARN resource allocation, Keddah can be used to create Hadoop traffic workloads, enabling reproducible Hadoop research in more realistic scenarios.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,Modeling and Simulation
Reference41 articles.
1. Apache Software Foundation. 2017. The Apache Mahout project. Retrieved from http://mahout.apache.org/. Apache Software Foundation. 2017. The Apache Mahout project. Retrieved from http://mahout.apache.org/.
2. A scalable, commodity data center network architecture
3. Data center TCP (DCTCP)
4. Quantitative comparisons of the state-of-the-art data center architectures
5. The Case for Evaluating MapReduce Performance Using Workload Suites
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献