Fast Approximate Score Computation on Large-Scale Distributed Data for Learning Multinomial Bayesian Networks

Author:

Katib Anas1,Rao Praveen1ORCID,Barnard Kobus2,Kamhoua Charles3

Affiliation:

1. University of Missouri-Kansas City, MO

2. University of Arizona, Tucson, AZ

3. Army Research Lab, Adelphi, MD

Abstract

In this article, we focus on the problem of learning a Bayesian network over distributed data stored in a commodity cluster. Specifically, we address the challenge of computing the scoring function over distributed data in an efficient and scalable manner, which is a fundamental task during learning. While exact score computation can be done using the MapReduce-style computation, our goal is to compute approximate scores much faster with probabilistic error bounds and in a scalable manner. We propose a novel approach, which is designed to achieve the following: (a) decentralized score computation using the principle of gossiping; (b) lower resource consumption via a probabilistic approach for maintaining scores using the properties of a Markov chain; and (c) effective distribution of tasks during score computation (on large datasets) by synergistically combining well-known hashing techniques. We conduct theoretical analysis of our approach in terms of convergence speed of the statistics required for score computation, and memory and network bandwidth consumption. We also discuss how our approach is capable of efficiently recomputing scores when new data are available. We conducted a comprehensive evaluation of our approach and compared with the MapReduce-style computation using datasets of different characteristics on a 16-node cluster. When the MapReduce-style computation provided exact statistics for score computation, it was nearly 10 times slower than our approach. Although it ran faster on randomly sampled datasets than on the entire datasets, it performed worse than our approach in terms of accuracy. Our approach achieved high accuracy (below 6% average relative error) in estimating the statistics for approximate score computation on all the tested datasets. In conclusion, it provides a feasible tradeoff between computation time and accuracy for fast approximate score computation on large-scale distributed data.

Funder

National Science Foundation

King Abdullah Scholarship Program

U.S. Air Force Summer Faculty Fellowship Program and the University of Missouri Research Board

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference64 articles.

1. {n.d.}. 2010. Java-Gossip. Retrieved from https://code.google.com/archive/p/java-gossip/. {n.d.}. 2010. Java-Gossip. Retrieved from https://code.google.com/archive/p/java-gossip/.

2. Gossip Algorithms

3. 2017. CloudLab. Retrieved from https://www.cloudlab.us/. 2017. CloudLab. Retrieved from https://www.cloudlab.us/.

4. 2017. Kyro. Retrieved from https://github.com/EsotericSoftware/kryo. 2017. Kyro. Retrieved from https://github.com/EsotericSoftware/kryo.

5. 2017. LZ4 - Extremely Fast Compression. Retrieved from https://github.com/lz4/lz4. 2017. LZ4 - Extremely Fast Compression. Retrieved from https://github.com/lz4/lz4.

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Efficient parameter learning for Bayesian Network classifiers following the Apache Spark Dataframes paradigm;Knowledge and Information Systems;2024-04-08

2. Learning high-dependence Bayesian network classifier with robust topology;Expert Systems with Applications;2024-04

3. DistriBayes: A Distributed Platform for Learning, Inference and Attribution on Large Scale Bayesian Network;Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining;2023-02-27

4. A Gossip-Based System for Fast Approximate Score Computation in Multinomial Bayesian Networks;2019 IEEE 35th International Conference on Data Engineering (ICDE);2019-04

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3