1. V. Likhosherstov, A. Arnab, K. Choromanski, M. Lucic, Y. Tay, A. Weller, and M. Dehghani, "Polyvit: Co-training vision transformers on images, videos and audio," arXiv preprint arXiv:2111.12993, 2021.
2. Y. Gong, Y.-A. Chung, and J. Glass, "Ast: Audio spectrogram transformer," arXiv preprint arXiv:2104.01778, 2021.
3. Deepfake generation and detection, a survey
4. E. Altuncu, V. N. Franqueira, and S. Li, "Deepfake: Definitions, Performance Metrics and Standards, Datasets and Benchmarks, and a Meta-Review," arXiv preprint arXiv:2208.10913, 2022.
5. Review of audio deepfake detection techniques: Issues and prospects