ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework-Reference-Cited by-同舟云学术

ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework

Published:2024-07 Issue:11 Volume:17 Page:3538-3550
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Yun Joohyung¹,Tak Byungchul²,Han Wook-Shin³

Affiliation:

1. POSTECH, Pohang, Republic of Korea

2. Kyungpook National University, Daegu, Republic of Korea

3. Graduate School of AI, POSTECH, Republic of Korea

Abstract

The schemalessness, one of the major advantages of JSON representation format, comes with high penalties in querying and operations by denying various critical functions such as query optimizations, indexing, or data verification. There have been continuous efforts to develop an accurate JSON schema discovery algorithm from a bag of JSON documents. Unfortunately, existing schema discovery techniques, being top-down algorithms, face challenges from the lack of visibility into children nodes of JSON tree. With absence of the information about lower-level JSON elements, top-down algorithms need to employ assumptions and heuristics to decide the schema type of nodes. However, such static decisions are often violated in datasets which causes top-down algorithms to perform poorly. To overcome this, we propose an algorithm, called ReCG, that processes JSON documents in a bottom-up manner. It builds up schemas from leaf elements upward in the JSON document tree and, thus, can make more informed decisions of the schema node types. In addition, we adopt MDL (Minimum Description Length) principles systematically while building up the schemas to choose among candidate schemas the most concise yet accurate one with well-balanced generality. Evaluations show that our technique improves the recall and precision of found schemas by as high as 47%, resulting in 46% better F1 score while also performing 2.11× faster on average against the state-of-the-art.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3681954.3682019

Reference47 articles.

1. 2024. Technical Report. Retrieved July 15, 2024 from https://sites.google.com/dblab.postech.ac.kr/recg-technical-report

2. Tarfah Alrashed, Jumana Almahmoud, Amy X. Zhang, and David R. Karger. 2020. ScrAPIr: Making Web Data APIs Accessible to End Users. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (, Honolulu, HI, USA,) (CHI '20). ACM, New York, NY, USA, 1--12.

3. Validation of Modern JSON Schema: Formalization and Complexity

4. Mohamed Amine Baazizi, Houssem Ben Lahmar, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2017. Schema Inference for Massive JSON Datasets. In Proceedings of the Conference on Extending Database Technology (EDBT). 222--233.

5. Schemas and Types for JSON Data