Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options-Reference-Cited by-同舟云学术

Spark Performance Optimization Analysis With Multi-Layer Parameter Using Shuffling and Scheduling With Data Serialization in Different Data Caching Options

Published:2021-01 Issue:1 Volume:1 Page:1-17
ISSN:2767-3804
Container-title:Journal of Technological Advancements
language:ng
Short-container-title:

Author:

Deleli Mesay¹,Adinew Deleli Mesay¹,Alemu Ayall Tewodros¹

Affiliation:

1. UESTC, China & Dilla University, Ethiopia

Abstract

As social networking services and e-commerce are growing rapidly, the number of online users also dynamically growing that facilitate contribution of huge contents to digital world. In such dynamic environment, meeting the demand of computing is very challenging special with existing computing model. Although Spark is recently introduced to alleviate the problems with concept of in-memory computing for big data analytic with many parameters configuration that allow to configure and improve its performance, still it has performance bottleneck which require to investigate performance improvement mechanism by focus on the combinations of Scheduling and Shuffle Manager with data serialization with intermediate data caching options. Standalone cluster computing model was selected as experimental methodology with submit command line for data submission. Three Spark application such as WorkCount, TeraSort and PageRank were selected and developed for experiment. As a result, 2.45% and 8.01% performance improvement are achieved in OFFHEAP and Memory Only Ser data caching option, respectively.

Publisher

IGI Global

Reference27 articles.

1. Spark Performance Optimization Analysis in Memory Tuning On GC Overhead for Big Data Analytics

2. Spark Performance Optimization Analysis In Memory Management with Deploy Mode In Standalone Cluster Computing

3. Aggarwal, C., Subbian, K., Butler, K., Stephens, M., Stephens, M., Chakrabarti, D., Kumar, R., Tomkins, A., Clauset, A., Moore, C., Newman, M. E. J., Csardi, G., Nepusz, T., Decelle, A., Krzakala, F., Moore, C., Zdeborov, L., Eisinga, R., Te Grotenhuis, M., … Cov, E. R. (2014). SNAP Datasets: Stanford Large Network Dataset Collection. Physical Review Letters, Complex Sy, (1).

4. Performance Characterization of Spark Workloads on Shared NUMA Systems