Formal semantics and high performance in declarative machine learning using Datalog
-
Published:2021-05-31
Issue:5
Volume:30
Page:859-881
-
ISSN:1066-8888
-
Container-title:The VLDB Journal
-
language:en
-
Short-container-title:The VLDB Journal
Author:
Wang Jin,Wu Jiacheng,Li Mingda,Gu Jiaqi,Das Ariyam,Zaniolo Carlo
Abstract
AbstractWith an escalating arms race to adopt machine learning (ML) in diverse application domains, there is an urgent need to support declarative machine learning over distributed data platforms. Toward this goal, a new framework is needed where users can specify ML tasks in a manner where programming is decoupled from the underlying algorithmic and system concerns. In this paper, we argue that declarative abstractions based on Datalog are natural fits for machine learning and propose a purely declarative ML framework with a Datalog query interface. We show that using aggregates in recursive Datalog programs entails a concise expression of ML applications, while providing a strictly declarative formal semantics. This is achieved by introducing simple conditions under which the semantics of recursive programs is guaranteed to be equivalent to that of aggregate-stratified ones. We further provide specialized compilation and planning techniques for semi-naive fixpoint computation in the presence of aggregates and optimization strategies that are effective on diverse recursive programs and distributed data platforms. To test and demonstrate these research advances, we have developed a powerful and user-friendly system on top of Apache Spark. Extensive evaluations on large-scale datasets illustrate that this approach will achieve promising performance gains while improving both programming flexibility and ease of development and deployment for ML applications.
Publisher
Springer Science and Business Media LLC
Subject
Hardware and Architecture,Information Systems
Reference79 articles.
1. Meng, X., Bradley, J.K., Yavuz, B., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17, 34:1–34:7 (2016) 2. Apache Mahout . https://mahout.apache.org/ 3. Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012) 4. Li, Y., Wang, J., Li, M., Das, A., Gu, J., Zaniolo, C.: Kddlog: Performance and scalability in knowledge discovery by declarative queries with aggregates. In: IEEE International Conference on Data Engineering (ICDE), (2021) 5. Bellomarini, L., Sallinger, E., Gottlob, G.: The vadalog system: datalog-based reasoning for knowledge graphs. Proc. VLDB Endow. 11(9), 975–987 (2018)
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Communication-Avoiding Recursive Aggregation;2023 IEEE International Conference on Cluster Computing (CLUSTER);2023-10-31 2. Provenance-based Explanations for Machine Learning (ML) Models;2023 IEEE 39th International Conference on Data Engineering Workshops (ICDEW);2023-04 3. Provenance-based Explanations for Machine Learning (ML) Models;I C DATA ENGIN WORKS;2023 4. Demonstration of LogicLib: An Expressive Multi-Language Interface over Scalable Datalog System;Proceedings of the 31st ACM International Conference on Information & Knowledge Management;2022-10-17 5. Optimizing Parallel Recursive Datalog Evaluation on Multicore Machines;Proceedings of the 2022 International Conference on Management of Data;2022-06-10
|
|