1. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers;wang;2012 arXiv preprint arXiv,2020
2. Exploring Simple Siamese Representation Learning
3. Adafactor: Adaptive learning rates with sublinear memory cost;shazeer;International Conference on Machine Learning,2018
4. Unsupervised cross-lingual representation learning at scale;conneau;arXiv preprint arXiv 1911 11698,2019
5. Perplexity—a measure of the difficulty of speech recognition tasks