1. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In 2016 Proceedings of the European Conference on Computer Vision. 382–398.
2. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 Proceedings of the IEEE conference on computer vision and pattern recognition. 6077–6086.
3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In 2005 Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.
4. Top-down framework for weakly-supervised grounded image captioning
5. Chen Cai, Kim-Hui Yap, and Suchen Wang. 2022. Attribute Conditioned Fashion Image Captioning. In 2022 IEEE International Conference on Image Processing. 1921–1925. https://doi.org/10.1109/ICIP46576.2022.9897417