Affiliation:
1. Imperial College London
2. Neo4j
Abstract
The separation of data and code/queries has served Data Management Systems (DBMSs) well for decades. However, while the resulting soundness and rigidity are the basis for many performance-oriented optimizations, it lacks the flexibility to efficiently support modern data science applications: data cleansing, data ingestion/augmentation or generative models. To support such applications without sacrificing performance, we propose a new logical data model called
Homoiconic Collection Processing (HCP).
HCP is based on a well-known Meta-Programming concept called
Homoiconicity
(a unified representation for code and data).
In a DBMS, HCP supports the storage of "classic" relational data but also allows the storage and evaluation of code fragments we refer to as "Homoiconic Expressions". Homoiconic Expressions enable applications such as data imputation
directly in the database kernel.
Implemented naïvely, such flexibility would come at a prohibitive cost in terms of performance. To make HCP performance-competitive with highly-tuned in-memory DBMSs, we develop a novel storage and processing model called
Shape-Wise Microbatching (SWM)
and implement it in a system called BOSS. BOSS is performance-competitive with high-performance DBMSs while offering unprecedented extensibility. To demonstrate the extensibility, we implement an extension for impute-and-query workloads: BOSS outperforms state-of-the-art homoiconic runtimes and data imputation systems by two to five orders of magnitude.
Publisher
Association for Computing Machinery (ACM)
Reference57 articles.
1. Apache. 2023. Open Office Calc. Retrieved 2024-01-22 from https://www.openoffice.org/product/calc.html
2. Apple. 2023. Apple Numbers. Retrieved 2024-01-22 from https://www.apple.com/numbers/
3. Apache Arrow. 2023. Retrieved 2023-02-24 from https://arrow.apache.org
4. Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management