Funder
National Natural Science Foundation of China
Reference57 articles.
1. An image is worth 16×16 words: Transformers for image recognition at scale;dosovitskiy;ICLRE,0
2. ActBERT: Learning Global-Local Video-Text Representations
3. BERT: pre-training of deep bidirectional transformers for language understanding;devlin;NAACL-HLT,0
4. Towards effective multi-modal interchanges in zero-resource sounding object localization;zhao;NIPS,0
5. VisualSparta: An Embarrassingly Simple Approach to Large-scale Text-to-Image Search with Weighted Bag-of-words
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Rethinking Missing Modality Learning from a Decoding Perspective;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26