Affiliation:
1. Indian Institute of Science
Abstract
Synthesizing data using declarative formalisms has been persuasively advocated in contemporary data generation frameworks. In particular, they specify operator output volumes through row-cardinality constraints. However, thus far, adherence to these volumetric constraints has been limited to the Filter and Join operators. A critical deficiency is the lack of support for the Projection operator, which is at the core of basic SQL constructs such as Distinct, Union and Group By. The technical challenge here is that cardinality
unions
in multi-dimensional space, and not mere summations, need to be captured in the generation process. Further, dependencies
across
different data subspaces need to be taken into account.
We address the above lacuna by presenting
PiGen
, a dynamic data generator that incorporates Projection cardinality constraints in its ambit. The design is based on a projection subspace division strategy that supports the expression of constraints using optimized linear programming formulations. Further, techniques of symmetric refinement and workload decomposition are introduced to handle constraints across different projection subspaces. Finally, PiGen supports dynamic generation, where data is generated on-demand during query processing, making it amenable to Big Data environments. A detailed evaluation on workloads derived from real-world and synthetic benchmarks demonstrates that PiGen can accurately and efficiently model Projection outcomes, representing an essential step forward in customized database generation.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference26 articles.
1. [n.d.]. Dagstuhl Seminar 21442. Ensuring the Reliability and Robustness of Database Management Systems. https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21442 [n.d.]. Dagstuhl Seminar 21442. Ensuring the Reliability and Robustness of Database Management Systems. https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=21442
2. [n.d.]. JOB Benchmark. https://github.com/gregrahn/join-order-benchmark [n.d.]. JOB Benchmark. https://github.com/gregrahn/join-order-benchmark
3. [n.d.]. PostgreSQL. https://www.postgresql.org/docs/9.6/ [n.d.]. PostgreSQL. https://www.postgresql.org/docs/9.6/
4. [n.d.]. TPC-DS. http://tpc.org/tpcds/ [n.d.]. TPC-DS. http://tpc.org/tpcds/
5. [n.d.]. TPC-H. http://tpc.org/tpch/ [n.d.]. TPC-H. http://tpc.org/tpch/
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Lauca: A Workload Duplicator for Benchmarking Transactional Database Performance;IEEE Transactions on Knowledge and Data Engineering;2024-07
2. StreamBed: Capacity Planning for Stream Processing;Proceedings of the 18th ACM International Conference on Distributed and Event-based Systems;2024-06-24
3. Mirage: Generating Enormous Databases for Complex Workloads;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13
4. Synthetic Data Generation for Enterprise DBMS;2023 IEEE 39th International Conference on Data Engineering (ICDE);2023-04