dsJSON: A Distributed SQL JSON Processor

Author:

Saeedan Majid1ORCID,Eldawy Ahmed1ORCID,Zhao Zhijia1ORCID

Affiliation:

1. University of California, Riverside, Riverside, CA, USA

Abstract

The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Reference40 articles.

1. Json encoder and decoder. Available at https://docs.python.org/3/library/json.html. Json encoder and decoder. Available at https://docs.python.org/3/library/json.html.

2. MongoDB. Available at https://www.mongodb.com. MongoDB. Available at https://www.mongodb.com.

3. A unified engine for big data processing;Apache;Commun. ACM,2016

4. Bestbuy developer api 2021. Retrieved from https://bestbuyapis.github.io/api-documentation/. Bestbuy developer api 2021. Retrieved from https://bestbuyapis.github.io/api-documentation/.

5. Jackson 2021. Available at https://github.com/FasterXML/jackson. Jackson 2021. Available at https://github.com/FasterXML/jackson.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3