1. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping;dodge;ArXiv Preprint,2020
2. Bert: Pre-training of deep bidirectional transformers for language understanding;devlin;ArXiv Preprint,2018
3. Attention is all you need;vaswani;Advances in neural information processing systems,2017