Inference Optimization of Foundation Models on AI Accelerators-Reference-Cited by-同舟云学术

Inference Optimization of Foundation Models on AI Accelerators

Published:2024-08-24 Issue: Volume:35 Page:6605-6615
ISSN:
Container-title:Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
language:
Short-container-title:

Author:

Park Youngsuk¹^ORCID,Budhathoki Kailash²^ORCID,Chen Liangfu¹^ORCID,Kübler Jonas M.²^ORCID,Huang Jiaji¹^ORCID,Kleindessner Matthäus²^ORCID,Huan Jun¹^ORCID,Cevher Volkan³^ORCID,Wang Yida¹^ORCID,Karypis George¹^ORCID

Affiliation:

1. AWS AI, Santa Clara, CA, USA

2. AWS AI, Tübingen, Germany

3. AWS AI & EPFL, Tübingen, Germany

Publisher

ACM

Link

https://dl.acm.org/doi/pdf/10.1145/3637528.3671465

Reference140 articles.

1. Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv preprint arXiv:2403.02310 (2024).

2. Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 (2023).

3. Ebtesam Almazrouei Hamza Alobeidli Abdulaziz Alshamsi Alessandro Cappelli Ruxandra Cojocaru Mérouane Debbah Étienne Goffinet Daniel Hesslow Julien Launay Quentin Malartic et al. 2023. The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023).

4. AMD. 2024. AMD matrix cores. https://rocm.blogs.amd.com/software-tools-optimization/matrix-cores/README.html

5. N. Yang amd T. Ge, L. Wang, B. Jiao, D. Jiang, L. Yang, R. Majumder, and F. Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv preprint arXiv:2304.04487 (2023).