Navigating the Landscape of Distributed Computing Frameworks for Machine and Deep Learning-Reference-Cited by-同舟云学术

Navigating the Landscape of Distributed Computing Frameworks for Machine and Deep Learning

Published:2023-06-02 Issue: Volume: Page:1-25
ISSN:2327-0411
Container-title:Scalable and Distributed Machine Learning and Deep Learning Patterns
language:
Short-container-title:

Author:

Ramasamy Mekala¹^ORCID,T Agila Harshini²,Elangovan Mohanraj³

Affiliation:

1. Bannari Amman Institute of Technology, India

2. Vellore Institute of Technology, Chennai, India

3. K.S. Rangasamy College of Technology, India

Abstract

For a number of reasons, distributed computing is crucial to machine learning and deep learning models. In the beginning, it makes it possible to train big models that won't fit in a single machine's memory. Second, by distributing the burden over several machines, it expedites the training process. Thirdly, it enables the management of vast amounts of data that may be dispersed across multiple devices or kept remotely. The system can continue processing data even if one machine fails because of distributed computing, which further improves fault tolerance. This chapter summarizes major frameworks Tensorflow, Pytorch, Apache spark Hadoop, and Horovod that are enabling developers to design and implement distributed computing models using large datasets. Some of the challenges faced by the distributed computing models are communication overhead, fault tolerance, load balancing, scalability and security, and the solutions are proposed to overcome the abovementioned challenges.

Publisher

IGI Global

Reference27 articles.

1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., & Zheng, X. (2016). TensorFlow: A System for Large-Scale Machine Learning. Journal Name: Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI '16) .

2. BBBC-DDRL: A hybrid big-bang big-crunch optimization and deliberated deep reinforced learning mechanisms for cyber-attack detection

3. “Communication-Efficient Distributed Stochastic Gradient Descent with Pooling Operator” Journal name;Z.Cai;SSRN,2023

4. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., & Title, Z. Z. “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” Journal Name: Proceedings of the 2015 ACM Symposium on Cloud Computing (SoCC '15) Volume: N/A Issue: N/A Year of Publication: 2015 Pages: 1-13

5. Jason Dai, Ding Ding, Dongjie Shi, Shengsheng Huang, Jiao Wang, Xin Qiu, Kai Huang, Guoqiong Song, Yang Wang, Qiyuan Gong, Jiaming Song, Shan Yu, Le Zheng, Yina Chen, Junwei Deng, Ge Song Title: BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster, Journal name: arXiv Volume:2204.01715 Year of Publication:2022