Heterogeneous Distributed Big Data Clustering on Sparse Grids-Reference-Cited by-同舟云学术

Heterogeneous Distributed Big Data Clustering on Sparse Grids

Published:2019-03-07 Issue:3 Volume:12 Page:60
ISSN:1999-4893
Container-title:Algorithms
language:en
Short-container-title:Algorithms

Author:

Pfander David,Daiß Gregor,Pflüger Dirk

Abstract

Clustering is an important task in data mining that has become more challenging due to the ever-increasing size of available datasets. To cope with these big data scenarios, a high-performance clustering approach is required. Sparse grid clustering is a density-based clustering method that uses a sparse grid density estimation as its central building block. The underlying density estimation approach enables the detection of clusters with non-convex shapes and without a predetermined number of clusters. In this work, we introduce a new distributed and performance-portable variant of the sparse grid clustering algorithm that is suited for big data settings. Our computed kernels were implemented in OpenCL to enable portability across a wide range of architectures. For distributed environments, we added a manager–worker scheme that was implemented using MPI. In experiments on two supercomputers, Piz Daint and Hazel Hen, with up to 100 million data points in a ten-dimensional dataset, we show the performance and scalability of our approach. The dataset with 100 million data points was clustered in 1198 s using 128 nodes of Piz Daint. This translates to an overall performance of 352 TFLOPS . On the node-level, we provide results for two GPUs, Nvidia’s Tesla P100 and the AMD FirePro W8100, and one processor-based platform that uses Intel Xeon E5-2680v3 processors. In these experiments, we achieved between 43% and 66% of the peak performance across all computed kernels and devices, demonstrating the performance portability of our approach.

Funder

Deutsche Forschungsgemeinschaft

Publisher

MDPI AG

Subject

Computational Mathematics,Computational Theory and Mathematics,Numerical Analysis,Theoretical Computer Science

Link

https://www.mdpi.com/1999-4893/12/3/60/pdf

Reference33 articles.

1. The Elements of Statistical Learning;Hastie,2009

2. An efficient k-means clustering algorithm: analysis and implementation

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Big data and human resource management: paving the way toward sustainability;European Journal of Innovation Management;2023-08-31

2. Application of Big Data Clustering Algorithm in Electrical Engineering Automation;Journal of Applied Mathematics;2022-11-22

3. Fast Sparse Grid Operations Using the Unidirectional Principle: A Generalized and Unified Framework;Lecture Notes in Computational Science and Engineering;2021

4. Resource-Aware Device Allocation of Data-Parallel Applications on Heterogeneous Systems;Electronics;2020-11-02