OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization-Reference-Cited by-同舟云学术

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

Published:2023-06-17 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the 50th Annual International Symposium on Computer Architecture
language:
Short-container-title:

Author:

Guo Cong¹²^ORCID,Tang Jiaming¹²^ORCID,Hu Weiming¹²^ORCID,Leng Jingwen¹²^ORCID,Zhang Chen³^ORCID,Yang Fan³^ORCID,Liu Yunxin⁴⁵^ORCID,Guo Minyi¹²^ORCID,Zhu Yuhao⁶^ORCID

Affiliation:

1. Shanghai Jiao Tong University, Shanghai, China

2. Shanghai Qi Zhi Institute, Shanghai, China

3. Microsoft Research, Beijing, China

4. Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China

5. Shanghai Artificial Intelligence Laboratory, Shanghai, China

6. University of Rochester, Rochester, New York, USA

Publisher

ACM

Reference99 articles.

1. 2020. Nvidia ampere architecture whitepaper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf. 2020. Nvidia ampere architecture whitepaper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.

2. Cnvlutin

3. Analyzing CUDA workloads using a detailed GPU simulator

4. Ron Banner , Yury Nahshan , and Daniel Soudry . 2019. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems 32 ( 2019 ). Ron Banner, Yury Nahshan, and Daniel Soudry. 2019. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems 32 (2019).

5. Yoshua Bengio , Nicholas Léonard , and Aaron Courville . 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 ( 2013 ). Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013).

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Cambricon-D: Full-Network Differential Acceleration for Diffusion Models;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

2. Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

3. Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models;IEEE Transactions on Circuits and Systems I: Regular Papers;2024-06

4. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI's LLM with Open Source SLMs in Production;2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS);2024-05-05

5. Towards Cognitive AI Systems: Workload and Characterization of Neuro-Symbolic AI;2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS);2024-05-05