1. Learning deep features for scene recognition using places database;zhou;Advances in neural information processing systems,2014
2. Deep multimodal semantic embeddings for speech and images
3. Vision as an interlingua: Learning multilingual semantic embeddings of untranscribed speech;harwath;IEEE International Conference on Acoustics Speech and Signal Processing,2018
4. Unsupervised learning of spoken language with visual context;harwath;Advances in neural information processing systems,2016
5. Text-Free Image-to-Speech Synthesis Using Learned Segmental Units