1. BERT: Pre-training of deep bidirectional transformers for language understanding;Devlin
2. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter;Sanh;ArXiv,2019
3. Language models are unsupervised multitask learners;Radford;OpenAI blog,2019
4. Isotropy in the contextual embedding space: Clusters and manifolds;Cai
5. Visualizing and measuring the geometry of bert;Reif;Advances in Neural Information Processing Systems,2019