Formal semantics and high performance in declarative machine learning using Datalog-Reference-Cited by-同舟云学术

Formal semantics and high performance in declarative machine learning using Datalog

Published:2021-05-31 Issue:5 Volume:30 Page:859-881
ISSN:1066-8888
Container-title:The VLDB Journal
language:en
Short-container-title:The VLDB Journal

Author:

Wang Jin,Wu Jiacheng,Li Mingda,Gu Jiaqi,Das Ariyam,Zaniolo Carlo

Abstract

AbstractWith an escalating arms race to adopt machine learning (ML) in diverse application domains, there is an urgent need to support declarative machine learning over distributed data platforms. Toward this goal, a new framework is needed where users can specify ML tasks in a manner where programming is decoupled from the underlying algorithmic and system concerns. In this paper, we argue that declarative abstractions based on Datalog are natural fits for machine learning and propose a purely declarative ML framework with a Datalog query interface. We show that using aggregates in recursive Datalog programs entails a concise expression of ML applications, while providing a strictly declarative formal semantics. This is achieved by introducing simple conditions under which the semantics of recursive programs is guaranteed to be equivalent to that of aggregate-stratified ones. We further provide specialized compilation and planning techniques for semi-naive fixpoint computation in the presence of aggregates and optimization strategies that are effective on diverse recursive programs and distributed data platforms. To test and demonstrate these research advances, we have developed a powerful and user-friendly system on top of Apache Spark. Extensive evaluations on large-scale datasets illustrate that this approach will achieve promising performance gains while improving both programming flexibility and ease of development and deployment for ML applications.

Publisher

Springer Science and Business Media LLC

Subject

Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1007/s00778-021-00665-6.pdf

Reference79 articles.

1. Meng, X., Bradley, J.K., Yavuz, B., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17, 34:1–34:7 (2016)

2. Apache Mahout . https://mahout.apache.org/

3. Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)

4. Li, Y., Wang, J., Li, M., Das, A., Gu, J., Zaniolo, C.: Kddlog: Performance and scalability in knowledge discovery by declarative queries with aggregates. In: IEEE International Conference on Data Engineering (ICDE), (2021)

5. Bellomarini, L., Sallinger, E., Gottlob, G.: The vadalog system: datalog-based reasoning for knowledge graphs. Proc. VLDB Endow. 11(9), 975–987 (2018)

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Communication-Avoiding Recursive Aggregation;2023 IEEE International Conference on Cluster Computing (CLUSTER);2023-10-31

2. Provenance-based Explanations for Machine Learning (ML) Models;2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW);2023-04

3. Provenance-based Explanations for Machine Learning (ML) Models;I C DATA ENGIN WORKS;2023

4. Demonstration of LogicLib: An Expressive Multi-Language Interface over Scalable Datalog System;Proceedings of the 31st ACM International Conference on Information & Knowledge Management;2022-10-17

5. Optimizing Parallel Recursive Datalog Evaluation on Multicore Machines;Proceedings of the 2022 International Conference on Management of Data;2022-06-10