Affiliation:
1. Microsoft Research India, Bangalore, Karnataka
2. University of Texas-Austin, Austin, Texas
3. Indraprastha Institute of Information Technology Delhi, New Delhi, Delhi
Abstract
We present the design, implementation, and evaluation of
INSTalytics
, a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers.
INSTalytics
amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension,
INSTalytics
enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle.
To achieve this,
INSTalytics
uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables
INSTalytics
to preserve the same recovery cost and availability as traditional replication.
INSTalytics
also uses compute-awareness to expose a new
sliced-read
API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes.
We have built a prototype implementation of
INSTalytics
in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture
Reference35 articles.
1. AMPLab. [n.d.]. AMP big-data benchmark. Retrieved from https://amplab.cs.berkeley.edu/benchmark/. AMPLab. [n.d.]. AMP big-data benchmark. Retrieved from https://amplab.cs.berkeley.edu/benchmark/.
2. Spark SQL
3. Rock you like a hurricane
4. EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Unshackling Database Benchmarking from Synthetic Workloads;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04
2. Towards Optimizing Storage Costs on the Cloud;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04
3. Remus: Efficient Live Migration for Distributed Databases with Snapshot Isolation;Proceedings of the 2022 International Conference on Management of Data;2022-06-10
4. Replicated layout for in-memory database systems;Proceedings of the VLDB Endowment;2021-12
5. The cosmos big data platform at Microsoft;Proceedings of the VLDB Endowment;2021-07