Mathematical Formulation of Learning and Its Computational Complexity for Transformers’ Layers-Reference-Cited by-同舟云学术

Mathematical Formulation of Learning and Its Computational Complexity for Transformers’ Layers

Published:2023-12-21 Issue:1 Volume:5 Page:34-50
ISSN:2673-4117
Container-title:Eng
language:en
Short-container-title:Eng

Author:

Pau Danilo Pietro¹^ORCID,Aymone Fabrizio Maria¹^ORCID

Affiliation:

1. Department of Systems Research and Applications, STMicroelectronics, 20864 Agrate Brianza, Italy

Abstract

Transformers are the cornerstone of natural language processing and other much more complicated sequential modelling tasks. The training of these models, however, requires an enormous number of computations, with substantial economic and environmental impacts. An accurate estimation of the computational complexity of training would allow us to be aware in advance about the associated latency and energy consumption. Furthermore, with the advent of forward learning workloads, an estimation of the computational complexity of such neural network topologies is required in order to reliably compare backpropagation with these advanced learning procedures. This work describes a mathematical approach, independent from the deployment on a specific target, for estimating the complexity of training a transformer model. Hence, the equations used during backpropagation and forward learning algorithms are derived for each layer and their complexity is expressed in the form of MACCs and FLOPs. By adding all of these together accordingly to their embodiment into a complete topology and the learning rule taken into account, the total complexity of the desired transformer workload can be estimated.

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences

Link

https://www.mdpi.com/2673-4117/5/1/3/pdf

Reference33 articles.

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4–9). Attention is All you Need. Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA.

2. Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv.

3. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., and Askell, A. (2020). Language Models are Few-Shot Learners. arXiv.

4. Mielke, S.J., Alyafeai, Z., Salesky, E., Raffel, C., Dey, M., Gallé, M., Raja, A., Si, C., Lee, W.Y., and Sagot, B. (2021). Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP. arXiv.

5. Maslej, N., Fattorini, L., Brynjolfsson, E., Etchemendy, J., Ligett, K., Lyons, T., Manyika, J., Ngo, H., Niebles, J.C., and Parli, V. (2023). The AI Index 2023 Annual Report, AI Index Steering Committee, Institute for Human-Centered AI, Stanford University. Technical report.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A hybrid approach of simultaneous segmentation and classification for medical image analysis;Multimedia Tools and Applications;2024-05-16

2. Forward Learning of Large Language Models by Consumer Devices;Electronics;2024-01-18