Author:
Yang Yue,Yu Yue,Li Yingying
Funder
National Natural Science Foundation of China
Publisher
Springer Science and Business Media LLC
Reference22 articles.
1. Almazán J, Gordo A, Fornés A et al (2014) Word spotting and recognition with embedded attributes. IEEE Trans Pattern Anal Mach Intell 36(12):2552–2566
2. Biten AF, Tito R, Mafla A et al (2019) Scene text visual question answering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4291–4301
3. Biten AF, Litman R, Xie Y et al (2022) Latr: layout-aware transformer for scenetext vqa. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16548–16558
4. Chia YK, Han VTY, Ghosal D et al (2024) Puzzlevqa: diagnosing multimodal reasoning challenges of language models with abstract visual patterns. arXiv:2403.13315
5. Devlin J, Chang MW, Lee K et al (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805