INSTalytics-Reference-Cited by-同舟云学术

INSTalytics

Published:2020-02-05 Issue:4 Volume:15 Page:1-30
ISSN:1553-3077
Container-title:ACM Transactions on Storage
language:en
Short-container-title:ACM Trans. Storage

Author:

Sivathanu Muthian¹,Vuppalapati Midhul¹,Gulavani Bhargav S.¹,Rajan Kaushik¹,Leeka Jyoti¹,Mohan Jayashree²,Kedia Piyus³

Affiliation:

1. Microsoft Research India, Bangalore, Karnataka

2. University of Texas-Austin, Austin, Texas

3. Indraprastha Institute of Information Technology Delhi, New Delhi, Delhi

Abstract

We present the design, implementation, and evaluation of INSTalytics , a co-designed stack of a cluster file system and the compute layer, for efficient big-data analytics in large-scale data centers. INSTalytics amplifies the well-known benefits of data partitioning in analytics systems; instead of traditional partitioning on one dimension, INSTalytics enables data to be simultaneously partitioned on four different dimensions at the same storage cost, enabling a larger fraction of queries to benefit from partition filtering and joins without network shuffle. To achieve this, INSTalytics uses compute-awareness to customize the three-way replication that the cluster file system employs for availability. A new heterogeneous replication layout enables INSTalytics to preserve the same recovery cost and availability as traditional replication. INSTalytics also uses compute-awareness to expose a new sliced-read API that improves performance of joins by enabling multiple compute nodes to read slices of a data block efficiently via co-ordinated request scheduling and selective caching at the storage nodes. We have built a prototype implementation of INSTalytics in a production analytics stack, and we show that recovery performance and availability is similar to physical replication, while providing significant improvements in query performance, suggesting a new approach to designing cloud-scale big-data analytics systems.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3369738

Reference35 articles.

1. AMPLab. [n.d.]. AMP big-data benchmark. Retrieved from https://amplab.cs.berkeley.edu/benchmark/. AMPLab. [n.d.]. AMP big-data benchmark. Retrieved from https://amplab.cs.berkeley.edu/benchmark/.

2. Spark SQL

3. Rock you like a hurricane

4. EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unshackling Database Benchmarking from Synthetic Workloads;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04

2. Towards Optimizing Storage Costs on the Cloud;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04

3. Remus: Efficient Live Migration for Distributed Databases with Snapshot Isolation;Proceedings of the 2022 International Conference on Management of Data;2022-06-10

4. Replicated layout for in-memory database systems;Proceedings of the VLDB Endowment;2021-12

5. The cosmos big data platform at Microsoft;Proceedings of the VLDB Endowment;2021-07