DocFormerv2: Local Features for Document Understanding-Reference-Cited by-同舟云学术

DocFormerv2: Local Features for Document Understanding

Published:2024-03-24 Issue:2 Volume:38 Page:709-718
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Appalaraju Srikar,Tang Peng,Dong Qi,Sankaran Nishant,Zhou Yichu,Manmatha R.

Abstract

We propose DocFormerv2, a multi-modal transformer for Visual Document Understanding (VDU). The VDU domain entails understanding documents (beyond mere OCR predictions) e.g., extracting information from a form, VQA for documents and other tasks. VDU is challenging as it needs a model to make sense of multiple modalities (visual, language and spatial) to make a prediction. Our approach, termed DocFormerv2 is an encoder-decoder transformer which takes as input - vision, language and spatial features. DocFormerv2 is pre-trained with unsupervised tasks employed asymmetrically i.e., two novel document tasks on encoder and one on the auto-regressive decoder. The unsupervised tasks have been carefully designed to ensure that the pre-training encourages local-feature alignment between multiple modalities. DocFormerv2 when evaluated on nine challenging datasets shows state-of-the-art performance on all over strong baselines - On TabFact (+4.3%), InfoVQA (+1.4%), FUNSD (+1.0%). Furthermore, to show generalization capabilities, on three VQA tasks involving scene-text, DocFormerv2 outperforms previous comparably-sized models and even does better than much larger models (such as GIT2, PaLI and Flamingo) on these tasks. Extensive ablations show that due to its novel pre-training tasks, DocFormerv2 understands multiple modalities better than prior-art in VDU.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. On Leveraging Multi-Page Element Relations in Visually-Rich Documents;2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC);2024-07-02

2. Incorporating multivariate semantic association graphs into multimodal networks for information extraction from documents;The Journal of Supercomputing;2024-05-22

3. Multimodal Adaptive Inference for Document Image Classification with Anytime Early Exiting;Lecture Notes in Computer Science;2024

4. Multi-page Document VQA with Recurrent Memory Transformer;Lecture Notes in Computer Science;2024