GEqO: ML-Accelerated Semantic Equivalence Detection

Author:

Haynes Brandon1ORCID,Alotaibi Rana1ORCID,Pavlenko Anna1ORCID,Leeka Jyoti2ORCID,Jindal Alekh3ORCID,Tian Yuanyuan1ORCID

Affiliation:

1. Microsoft Gray Systems Lab, Redmond, WA, USA

2. Microsoft, Redmond, WA, USA

3. SmartApps, Bellevue, WA, USA

Abstract

Large scale analytics engines have become a core dependency for modern data-driven enterprises to derive business insights and drive actions. These engines support a large number of analytic jobs processing huge volumes of data on a daily basis, and workloads are often inundated with overlapping computations across multiple jobs. Reusing common computation is crucial for efficient cluster resource utilization and reducing job execution time. Detecting common computation is the first and key step for reducing this computational redundancy. However, detecting equivalence on large-scale analytics engines requires efficient and scalable solutions that are fully automated. In addition, to maximize computation reuse, equivalence needs to be detected at the semantic level instead of just the syntactic level (i.e., the ability to detect semantic equivalence of seemingly different-looking queries). Unfortunately, existing solutions fall short of satisfying these requirements. In this paper, we take a major step towards filling this gap by proposing GEqO, a portable and lightweight machine-learning-based framework for efficiently identifying semantically equivalent computations at scale. GEqO introduces two machine-learning-based filters that quickly prune out nonequivalent subexpressions and employs a semi-supervised learning feedback loop to iteratively improve its model with an intelligent sampling mechanism. Further, with its novel database-agnostic featurization method, GEqO can transfer the learning from one workload and database to another. Our extensive empirical evaluation shows that, on TPC-DS-like queries, GEqO yields significant performance gains-up to 200x faster than automated verifiers-and finds up to 2x more equivalences than optimizer and signature-based equivalence detection approaches.

Publisher

Association for Computing Machinery (ACM)

Reference56 articles.

1. Serge Abiteboul , Richard Hull , and Victor Vianu . 1995. Foundations of Databases . Vol. 8 . Addison-Wesley Reading . Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Vol. 8. Addison-Wesley Reading.

2. Agiwal, Ankur and Lai , Kevin and Manoharan , Gokul Nath Babu and Roy, Indrajit and Sankaranarayanan, Jagan and Zhang, Hao and Zou, Tao and Chen, Min and Chen, Jim and Dai, Ming and others . 2021 . Napa : Powering Scalable Data Warehousing with Robust Query Performance at Google ., Vol. 14 , 12 (2021). Agiwal, Ankur and Lai, Kevin and Manoharan, Gokul Nath Babu and Roy, Indrajit and Sankaranarayanan, Jagan and Zhang, Hao and Zou, Tao and Chen, Min and Chen, Jim and Dai, Ming and others. 2021. Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google., Vol. 14, 12 (2021).

3. Automated selection of materialized views and indexes in SQL databases;Agrawal Sanjay;VLDB,2000

4. Automated generation of materialized views in oracle;Ahmed Rafi;VLDB,2020

5. Amazon. 2023 . Amazon Redshift: Automated materialized views. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html. Accessed : 2023. Amazon. 2023. Amazon Redshift: Automated materialized views. https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-auto-mv.html. Accessed: 2023.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3