Hardware-Efficient Data Imputation through DBMS Extensibility-Reference-Cited by-同舟云学术

Hardware-Efficient Data Imputation through DBMS Extensibility

Published:2024-07 Issue:11 Volume:17 Page:3497-3510
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Mohr-Daurat Hubert¹,Theodorakis Georgios²,Pirk Holger¹

Affiliation:

1. Imperial College London

2. Neo4j

Abstract

The separation of data and code/queries has served Data Management Systems (DBMSs) well for decades. However, while the resulting soundness and rigidity are the basis for many performance-oriented optimizations, it lacks the flexibility to efficiently support modern data science applications: data cleansing, data ingestion/augmentation or generative models. To support such applications without sacrificing performance, we propose a new logical data model called Homoiconic Collection Processing (HCP). HCP is based on a well-known Meta-Programming concept called Homoiconicity (a unified representation for code and data). In a DBMS, HCP supports the storage of "classic" relational data but also allows the storage and evaluation of code fragments we refer to as "Homoiconic Expressions". Homoiconic Expressions enable applications such as data imputation directly in the database kernel. Implemented naïvely, such flexibility would come at a prohibitive cost in terms of performance. To make HCP performance-competitive with highly-tuned in-memory DBMSs, we develop a novel storage and processing model called Shape-Wise Microbatching (SWM) and implement it in a system called BOSS. BOSS is performance-competitive with high-performance DBMSs while offering unprecedented extensibility. To demonstrate the extensibility, we implement an extension for impute-and-query workloads: BOSS outperforms state-of-the-art homoiconic runtimes and data imputation systems by two to five orders of magnitude.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3681954.3682016

Reference57 articles.

1. Apache. 2023. Open Office Calc. Retrieved 2024-01-22 from https://www.openoffice.org/product/calc.html

2. Apple. 2023. Apple Numbers. Retrieved 2024-01-22 from https://www.apple.com/numbers/

3. Apache Arrow. 2023. Retrieved 2023-02-24 from https://arrow.apache.org

4. Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management