1. M.Chen , J.Tworek , H.Jun , Q.Yuan , H.Ponde , J.Kaplan , H.Edwards , Y.Burda , N.Joseph and G.Brockman , et al. , Evaluating large language models trained on code , arXiv:2107.03374, 2021
2. A.Vaswani , N.Shazeer , N.Parmar , J.Uszkoreit , L.Jones , A. N.Gomez , Ł.Kaiser and I.Polosukhin , Attention is all you need , in Advances in neural information processing systems , 2017 , pp. 5998–6008
3. J.Devlin , M.-W.Chang , K.Lee and K.Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv:1810.04805, 2018
4. More generally, “tokens” are masked