Filter before you parse-Reference-Cited by-同舟云学术

Filter before you parse

Published:2018-07 Issue:11 Volume:11 Page:1576-1589
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Palkar Shoumik¹,Abuzaid Firas¹,Bailis Peter¹,Zaharia Matei²

Affiliation:

1. Stanford InfoLab

2. Databricks Inc.

Abstract

Exploratory big data applications often run on raw unstructured or semi-structured data formats, such as JSON files or text logs. These applications can spend 80--90% of their execution time parsing the data. In this paper, we propose a new approach for reducing this overhead: apply filters on the data's raw bytestream before parsing. This technique, which we call raw filtering, leverages the features of modern hardware and the high selectivity of queries found in many exploratory applications. With raw filtering, a user-specified query predicate is compiled into a set of filtering primitives called raw filters (RFs). RFs are fast, SIMD-based operators that occasionally yield false positives, but never false negatives. We combine multiple RFs into an RF cascade to decrease the false positive rate and maximize parsing throughput. Because the best RF cascade is data-dependent, we propose an optimizer that dynamically selects the combination of RFs with the best expected throughput, achieving within 10% of the global optimum cascade while adding less than 1.2% overhead. We implement these techniques in a system called Sparser, which automatically manages a parsing cascade given a data stream in a supported format (e.g., JSON, Avro, Parquet) and a user query. We show that many real-world applications are highly selective and benefit from Sparser. Across diverse workloads, Sparser accelerates state-of-the-art parsers such as Mison by up to 22 × and improves end-to-end application performance by up to 9 ×.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3236187.3236207

Cited by 44 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Lite2: A Schemaless Zero-Copy Serialization Format;Computers;2024-03-28

2. Validating CESU-8 Encoded Text Utilising SIMD Instructions;Proceedings of the 2024 13th International Conference on Software and Computer Applications;2024-02

3. On‐demand JSON: A better way to parse documents?;Software: Practice and Experience;2024-01-18

4. AS-Parser: Log Parsing Based on Adaptive Segmentation;Proceedings of the ACM on Management of Data;2023-12-08

5. TripleLP: Privacy-Preserving Log Parsing Based on Blockchain;2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP);2023-11-24