Author:
Belcastro Loris,Giampà Salvatore,Marozzo Fabrizio,Talia Domenico,Trunfio Paolo,Badia Rosa M.,Ejarque Jorge,Mammadli Nihad
Abstract
AbstractDeveloping and executing large-scale data analysis applications in parallel and distributed environments can be a complex and time-consuming task. Developers often find themselves diverted from their application logic to handle technical details about the underlying runtime and related issues. To simplify this process, ParSoDA, a Java library, has been proposed to facilitate the development of parallel data mining applications executed on HPC systems. It simplifies the process by providing built-in scalability mechanisms relying on the Hadoop and Spark frameworks. This paper presents ParSoDA-Py, the Python version of the ParSoDA library, which allows for further support of commonly used runtimes and libraries for big data analysis. After a complete library redesign, ParSoDA can be now easily integrated with other Python-based distributed runtimes for HPC systems, such as COMPSs and Apache Spark, and with the large ecosystem of Python-based data processing libraries. The paper discusses the adaptation process, which takes into consideration the new technical requirements, and evaluates both usability and scalability through some case study applications.
Funder
European Commission's Horizon 2020 Framework program and the European High-Performance Computing Joint Undertaking
National Centre for HPC, Big Data and Quantum Computing
FAIR – Future Artificial Intelligence Research
Spanish Government
Departament de Recerca i Universitats de la Generalitat de Catalunya
Università della Calabria
Publisher
Springer Science and Business Media LLC
Reference26 articles.
1. Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud: models. Techniques and Applications. Elsevier, Amsterdam, The Netherlands
2. Belcastro L, Marozzo F, Talia D, Trunfio P (2019) Parsoda: high-level parallel programming for social data mining. Soc Netw Anal Min 9(1):1
3. Belcastro L, Cantini R, Marozzo F, Orsino A, Talia D, Trunfio P (2022) Programming big data analysis: principles and solutions. J Big Data 9(4):1
4. Inoubli W, Aridhi S, Mezni H, Maddouri M, Mephu Nguifo E (2018) An experimental survey on big data frameworks. Futur Gener Comput Syst 86:546–564
5. Doulkeridis C, Vlachou A, Pelekis N, Theodoridis Y (2021) A survey on big data processing frameworks for mobility analytics. SIGMOD Rec 50(2):18–29