1. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805, 2019.
2. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. ArXiv abs/1907.11692, 2019.
3. Goyal, Saurabh "PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination." International Conference on Machine Learning, 2020.
4. Ye, D., Lin, Y., Huang, Y., Sun, M.: Tr-bert: Dynamic token reduction for accelerating bert inference. In: North American Chapter of the Association for Computational Linguistics, 2021.
5. GhostBERT: Generate More Features with Cheap Operations for BERT