1. Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NIPS (2016)
2. Harwath, D., Torralba, A., Glass, J.R.: Unsupervised learning of spoken language with visual context. In: NIPS (2016)
3. Lecture Notes in Computer Science;A Owens,2016
4. Arandjelović, R., Zisserman, A.: Look, listen and learn. In: Proceedings of ICCV (2017)
5. Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., Jordan, M.: Matching words and pictures. JMLR 3, 1107–1135 (2003)