Affiliation:
1. École Polytechnique Fédérale de Lausanne
2. École Polytechnique Fédérale de Lausanne and RAW Labs SA
Abstract
As data continues to be generated at exponentially growing rates in heterogeneous formats, fast analytics to extract meaningful information is becoming increasingly important. Systems widely use in-memory caching as one of their primary techniques to speed up data analytics. However, caches in data analytics systems cannot rely on simple caching policies and a fixed data layout to achieve good performance. Different datasets and workloads require different layouts and policies to achieve optimal performance.
This paper presents ReCache, a cache-based performance accelerator that is reactive to the cost and heterogeneity of diverse raw data formats. Using timing measurements of caching operations and selection operators in a query plan, ReCache accounts for the widely varying costs of reading, parsing, and caching data in nested and tabular formats. Combining these measurements with information about frequently accessed data fields in the workload, ReCache automatically decides whether a nested or relational column-oriented layout would lead to better query performance. Furthermore, ReCache keeps track of commonly utilized operators to make informed cache admission and eviction decisions. Experiments on synthetic and real-world datasets show that our caching techniques decrease caching overhead for individual queries by an average of 59%. Furthermore, over the entire workload, ReCache reduces execution time by 19-75% compared to existing techniques.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by Example;Proceedings of the ACM on Management of Data;2023-06-13
2. MUAR: Maximizing Utilization of Available Resources for Query Processing;2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW);2023-05
3. Metadata Caching in Presto: Towards Fast Data Processing;2022 IEEE International Conference on Big Data (Big Data);2022-12-17
4. JSON Tiles: Fast Analytics on Semi-Structured Data;Proceedings of the 2021 International Conference on Management of Data;2021-06-09
5. Efficient streaming subgraph isomorphism with graph neural networks;Proceedings of the VLDB Endowment;2021-01