Author:
Wei Zhanchen,Huang Qiulan,Sun Gongxing,Liu Xiaoyu
Abstract
Abstract
The traditional partial wave analysis (PWA) algorithm is designed to process data serially which requires a large amount of memory that may exceed the memory capacity of one single node to store runtime data. It is quite necessary to parallelize this algorithm in a distributed data computing framework to improve its performance. Within an existing production-level Hadoop cluster, we implement PWA algorithm on top of Spark to process data storing on low-level storage system HDFS. But in this case, sharing data through HDFS or internal data communication mechanism of Spark is extremely inefficient. In order to solve this problem, this paper presents an in-memory parallel computing method for PWA algorithm. With this system, we can easily share runtime data in parallel algorithms. We can ensure complete data locality to keep compatibility with the traditional data input/output way and cache most repeatedly used data in memory to improve the performance, owe to the data management mechanism of Alluxio.
Subject
General Physics and Astronomy
Reference8 articles.
1. Partial Wave Analysis Using Graphics Units;Berger;Journal of Physics: Conference Series,2010
2. 1 February 2017, kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data;Maillo;Knowledge-Based Systems
3. Scaling machine learning for target prediction in drug discovery using Apache Spark Future Generation;Harnie;Computer Systems,2017
4. Evaluating the Impact of Data Placement to Spark and SciDB with an Earth Science Use Case;Doan,2016