1. Interactive text2pickup networks for natural language-based human-robot collaboration;Ahn;IEEE Robot. Automat. Lett,2018
2. “Bottom-up and top-down attention for image captioning and visual question answering,”;Anderson,2018
3. “VQA: visual question answering,”;Antol,2015
4. “AMC: attention guided multi-modal correlation learning for image search,”;Chen,2017
5. “SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning,”;Chen,2017