Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision-Reference-Cited by-同舟云学术

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Published:2022-01-04 Issue:2 Volume:130 Page:435-454
ISSN:0920-5691
Container-title:International Journal of Computer Vision
language:en
Short-container-title:Int J Comput Vis

Author:

Shin Andrew^ORCID,Ishii Masato,Narihira Takuya

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://link.springer.com/content/pdf/10.1007/s11263-021-01547-8.pdf

Reference151 articles.

1. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S. (2016), Youtube-8m: A large-scale video classification benchmark. CoRR abs/1609.08675, http://arxiv.org/abs/1609.08675, 1609.08675

2. Agrawal P, Carreira J, Malik J (2015) Learning to see by moving. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV)

3. Akbari H, Yuan L, Qian R, Chuang W, Chang S, Cui Y, Gong B (2021) VATT: transformers for multimodal self-supervised learning from raw video, audio and text. CoRR abs/2104.11178, https://arxiv.org/abs/2104.11178, 2104.11178

4. Alberti C, Ling J, Collins M, Reitter D (2019) Fusion of detected objects in text for visual question answering. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp 2131–2140, https://doi.org/10.18653/v1/D19-1219, https://www.aclweb.org/anthology/D19-1219

5. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: ECCV

Cited by 20 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A vision transformer‐based robotic perception for early tea chrysanthemum flower counting in field environments;Journal of Field Robotics;2024-07-19

2. When Daformer Meets Multi-Modality Datasets;IGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium;2024-07-07

3. A comprehensive survey on applications of transformers for deep learning tasks;Expert Systems with Applications;2024-05

4. Hugs Bring Double Benefits: Unsupervised Cross-Modal Hashing with Multi-granularity Aligned Transformers;International Journal of Computer Vision;2024-02-18

5. Deep Semantic-Aware Proxy Hashing for Multi-Label Cross-Modal Retrieval;IEEE Transactions on Circuits and Systems for Video Technology;2024-01