Affiliation:
1. Microsoft Research, Redmond
2. University of Maryland, College Park
Abstract
Analytics over the increasing quantity of data stored in the Cloud has become very expensive, particularly due to the pay-as-you-go Cloud computation model. Data scientists typically manually extract samples of increasing data size (progressive samples) using domain-specific sampling strategies for exploratory querying. This provides them with user-control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. We propose a new progressive analytics system based on a progress model called Prism that (1) allows users to communicate progressive samples to the system; (2) allows efficient and deterministic query processing over samples; and (3) provides repeatable semantics and provenance to data scientists. We show that one can realize this model for atemporal relational queries using an unmodified temporal streaming engine, by re-interpreting temporal event fields to denote progress. Based on Prism, we build Now!, a progressive data-parallel computation framework for Windows Azure, where progress is understood as a first-class citizen in the framework. Now! works with "progress-aware reducers"- in particular, it works with streaming engines to support progressive SQL over big data. Extensive experiments on Windows Azure with real and synthetic workloads validate the scalability and benefits of Now! and its optimizations, over current solutions for progressive analytics.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
31 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Less is More: How Fewer Results Improve Progressive Join Query Processing;35th International Conference on Scientific and Statistical Database Management;2023-07-10
2. SynopsisDB: Distributed Synopsis-based Data Processing System;Companion of the 2023 International Conference on Management of Data;2023-06-04
3. Efficient Sampling for Big Provenance;Companion Proceedings of the ACM Web Conference 2023;2023-04-30
4. Tempura: a general cost-based optimizer framework for incremental data processing (Journal Version);The VLDB Journal;2023-03-20
5. Controlled Intentional Degradation in Analytical Video Systems;Proceedings of the 2022 International Conference on Management of Data;2022-06-10