1. BERT: pre-training of deep bidirectional transformers for language understanding;Devlin,2018
2. Language models are few-shot learners;Brown,2020
3. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model;Wang,2021
4. Common crawl blog, http://commoncrawl.org/connect/blog/. (Accessed: 2021-12-10).
5. The pile: An 800 gb dataset of diverse text for language modeling;Gao,2020