1. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp. 4171–4186 (2019)
2. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
3. Li, L., et al.: From mimicking to integrating: knowledge integration for pre-trained language models. In: EMNLP, pp. 6391–6402 (2022)
4. Li, Z., Xu, X., Shen, T., Xu, C., Gu, J.C., Tao, C.: Leveraging large language models for NLG evaluation: a survey. arXiv preprint arXiv:2401.07103 (2024)
5. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)