Affiliation:
1. University of California, Riverside
2. University of Verona, Verona VR, Italy
Abstract
This article explores the use of deep learning to choose an appropriate spatial partitioning technique for big data. The exponential increase in the volumes of spatial datasets resulted in the development of big spatial data frameworks. These systems need to partition the data across machines to be able to scale out the computation. Unfortunately, there is no current method to automatically choose an appropriate partitioning technique based on the input data distribution.
This article addresses this problem by using deep learning to train a model that captures the relationship between the data distribution and the quality of the partitioning techniques. We propose a solution that runs in two phases, training and application. The offline training phase generates synthetic data based on diverse distributions, partitions them using six different partitioning techniques, and measures their quality using four quality metrics. At the same time, it summarizes the datasets using a histogram and well-designed skewness measures. The data summaries and the quality metrics are then use to train a deep learning model. The second phase uses this model to predict the best partitioning technique given a new dataset that needs to be partitioned. We run an extensive experimental evaluation on big spatial data, and we experimentally show the applicability of the proposed technique. We show that the proposed model outperforms the baseline method in terms of accuracy for choosing the best partitioning technique by only analyzing the summary of the datasets.
Funder
National Science Foundation
National Institute of Food and Agriculture
Publisher
Association for Computing Machinery (ACM)
Subject
Discrete Mathematics and Combinatorics,Geometry and Topology,Computer Science Applications,Modeling and Simulation,Information Systems,Signal Processing
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Advances in AI-assisted biochip technology for biomedicine;Biomedicine & Pharmacotherapy;2024-08
2. A Generic Machine Learning Model for Spatial Query Optimization based on Spatial Embeddings;ACM Transactions on Spatial Algorithms and Systems;2024-04-13
3. L/STIM: A Framework for Detecting Multi-Stage Cyber Attacks;2024 International Russian Smart Industry Conference (SmartIndustryCon);2024-03-25
4. A learning-based framework for spatial join processing: estimation, optimization and tuning;The VLDB Journal;2024-02-13
5. Learned Spatial Data Partitioning;Proceedings of the Sixth International Workshop on Exploiting Artificial Intelligence Techniques for Data Management;2023-06-18