1. Attieh, J.: Optimizing the Performance of Text Classification Models by Improving the Isotropy of the Embeddings using a Joint Loss Function. Master’s thesis, Aalto University. School of Science (2022). http://urn.fi/URN:NBN:fi:aalto-202209255727
2. Biś, D., Podkorytov, M., Liu, X.: Too much in common: Shifting of embeddings in transformer language models and its implications. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 5117–5130. Association for Computational Linguistics, Online (Jun 2021). https://doi.org/10.18653/v1/2021.naacl-main.403, http://aclanthology.org/2021.naacl-main.403
3. Ethayarajh, K.: How contextual are contextualized word representations? comparing the geometry of BERT, ELMO, and GPT-2 embeddings. vol. abs/1909.00512 (2019). arXiv: abs/1909.00512
4. Gao, J., He, D., Tan, X., Qin, T., Wang, L., Liu, T.: Representation degeneration problem in training natural language generation models. In: International Conference on Learning Representations (2019). http://openreview.net/forum?id=SkEYojRqtm
5. Gong, C., He, D., Tan, X., Qin, T., Wang, L., Liu, T.Y.: Frage: Frequency-agnostic word representation. ArXiv arXiv:1809.06858 (2018)