Angel: a new large-scale machine learning system-Reference-Cited by-同舟云学术

Angel: a new large-scale machine learning system

Published:2017-02-24 Issue:2 Volume:5 Page:216-236
ISSN:2095-5138
Container-title:National Science Review
language:en
Short-container-title:

Author:

Jiang Jie¹²,Yu Lele¹,Jiang Jiawei¹,Liu Yuhong²,Cui Bin¹

Affiliation:

1. Key Lab of High Confidence Software Technologies (MOE), School of EECS, Peking University, Beijing 100871, China

2. Data Platform, Tencent Inc., Shenzhen 518057, China

Abstract

Abstract Machine Learning (ML) techniques now are ubiquitous tools to extract structural information from data collections. With the increasing volume of data, large-scale ML applications require an efficient implementation to accelerate the performance. Existing systems parallelize algorithms through either data parallelism or model parallelism. But data parallelism cannot obtain good statistical efficiency due to the conflicting updates to parameters while the performance is damaged by global barriers in model parallel methods. In this paper, we propose a new system, named Angel, to facilitate the development of large-scale ML applications in production environment. By allowing concurrent updates to model across different groups and scheduling the updates in each group, Angel can achieve a good balance between hardware efficiency and statistical efficiency. Besides, Angel reduces the network latency by overlapping the parameter pulling and update computing and also utilizes the sparseness of data to avoid the pulling of unnecessary parameters. We also enhance the usability of Angel by providing a set of efficient tools to integrate with application pipelines and provisioning efficient fault tolerance mechanisms. We conduct extensive experiments to demonstrate the superiority of Angel.

Funder

National Natural Science Foundation of China

National Basic Research Program of China

Shenzhen Government Research Project

Publisher

Oxford University Press (OUP)

Subject

Multidisciplinary

Link

http://academic.oup.com/nsr/article-pdf/5/2/216/31567304/nwx018.pdf

Reference37 articles.

1. Tencentrec: Real-time stream recommendation in practice;Huang;Proceedings of SIGMOD Conference 2015,2015

2. Real-time video recommendation exploration;Huang;Proceedings of, SIGMOD Conference 2016,2016

3. Spark: cluster computing with working sets;Zaharia;Proceedings of HotCloud 2010,2010

4. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing;Zaharia;Proceedings NSDI Conference 2012,2012

5. Petuum: a new platform for distributed machine learning on big data;Xing;IEEE Trans Big Data,2015

Cited by 52 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A novel device placement approach based on position-aware subgraph neural networks;Neurocomputing;2024-05

2. Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions;Neurocomputing;2024-01

3. Scaling Machine Learning with a Ring-based Distributed Framework;Proceedings of the 2023 7th International Conference on Computer Science and Artificial Intelligence;2023-12-08

4. A systematic evaluation of machine learning on serverless infrastructure;The VLDB Journal;2023-09-20

5. Machine Learning Method with Applications in Hardware Security of Post-Quantum Cryptography;Journal of Grid Computing;2023-03-20