Columnar formats for schemaless LSM-based document stores

Author:

Alkowaileet Wail Y.1,Carey Michael J.1

Affiliation:

1. University of California

Abstract

In the last decade, document store database systems have gained more traction for storing and querying large volumes of semi-structured data. However, the flexibility of the document stores' data models has limited their ability to store data in a column-major layout --- making them less performant for analytical workloads than column store relational databases. In this paper, we propose several techniques based on piggy-backing on Log-Structured Merge (LSM) tree events and tailored to document stores to store data in a columnar layout. We first extend the Dremel format, a popular on-disk columnar format for semi-structured data, to comply with document stores' flexible data model. We then introduce a new columnar layout for organizing and storing data in LSM-based storage. We also highlight the potential of using query compilation techniques for document stores, where values' types are known only at runtime. We have implemented and evaluated our techniques to measure their impact on storage, data ingestion, and query performance in Apache AsterixDB. Our experiments show significant performance gains, improving the query execution time by orders of magnitude while minimally impacting ingestion performance.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Reference53 articles.

1. 2021. Actian Vector. https://esd.actian.com/product/Vector 2021. Actian Vector. https://esd.actian.com/product/Vector

2. 2021. Apache AsterixDB. https://asterixdb.apache.org 2021. Apache AsterixDB. https://asterixdb.apache.org

3. 2021. Apache AsterixDB Object Serialization Reference. https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference 2021. Apache AsterixDB Object Serialization Reference. https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference

4. 2021. Apache Drill. https://drill.apache.org 2021. Apache Drill. https://drill.apache.org

5. 2021. Apache Kudu. https://kudu.apache.org 2021. Apache Kudu. https://kudu.apache.org

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. An Empirical Evaluation of Columnar Storage Formats;Proceedings of the VLDB Endowment;2023-10

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3