Distributed Bayesian posterior voting strategy for massive data-Reference-Cited by-同舟云学术

Distributed Bayesian posterior voting strategy for massive data

Published:2022 Issue:5 Volume:30 Page:1936-1953
ISSN:2688-1594
Container-title:Electronic Research Archive
language:
Short-container-title:era

Author:

Li Xuerui¹,Kang Lican²,Liu Yanyan¹,Wu Yuanshan³

Affiliation:

1. School of Mathematics and Statistics, Wuhan University, China

2. Center for Quantitative Medicine Duke-NUS Medical School, Singapore

3. School of Statistics and Mathematics, Zhongnan University of Economics and Law, China

Abstract

<abstract><p>The emergence of massive data has driven recent interest in developing statistical learning and large-scale algorithms for analysis on distributed platforms. One of the widely used statistical approaches is split-and-conquer (SaC), which was originally performed by aggregating all local solutions through a simple average to reduce the computational burden caused by communication costs. Aiming at lower computation cost and satisfactorily acceptable accuracy, this paper extends SaC to Bayesian variable selection for ultra-high dimensional linear regression and builds BVSaC for aggregation. Suppose ultrahigh-dimensional data are stored in a distributed manner across multiple computing nodes, with each computing resource containing a disjoint subset of data. On each node machine, we perform variable selection and coefficient estimation through a hierarchical Bayes formulation. Then, a weighted majority voting method BVSaC is used to combine the local results to retain good performance. The proposed approach only requires a small portion of computation cost on each local dataset and therefore eases the computational burden, especially in Bayesian computation, meanwhile, pays a little cost to receive accuracy, which in turn increases the feasibility of analyzing extraordinarily large datasets. Simulations and a real-world example show that the proposed approach performed as well as the whole sample hierarchical Bayes method in terms of the accuracy of variable selection and estimation.</p></abstract>

Publisher

American Institute of Mathematical Sciences (AIMS)

Reference28 articles.

1. Y. Zhang, M. J. Wainwright, J. C. Duchi, Communication-efficient algorithms for statistical optimization, Adv. Neural Inf. Process. Syst., 25 (2012). https://doi.org/10.1109/CDC.2012.6426691

2. A. Kleiner, A. Talwalkar, P. Sarkar, M. Jordan, The big data bootstrap, arXiv preprint, (2012), arXiv: 1206.6415.

3. T. Zhao, G. Cheng, H. Liu, A partially linear framework for massive heterogeneous data, Ann. Stat., 44 (2016), 1400–1437. https://doi.org/10.1214/15-AOS1410

4. Q. Xu, C. Cai, C. Jiang, F. Sun, X. Huang, Block average quantile regression for massive dataset, Stat. Pap. (Berl), 61 (2020), 141–165. https://doi.org/10.1007/s00362-017-0932-6

5. H. Battey, J. Fan, H. Liu, J. Lu, Z. Zhu, Distributed testing and estimation under sparse high dimensional models, Ann. Stat., 46 (2018), 1352. https://doi.org/10.1214/17-AOS1587

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An innovative approach of determining the sample data size for machine learning models: a case study on health and safety management for infrastructure workers;Electronic Research Archive;2022