VisFormers—Combining Vision and Transformers for Enhanced Complex Document Classification-Reference-Cited by-同舟云学术

VisFormers—Combining Vision and Transformers for Enhanced Complex Document Classification

Published:2024-02-16 Issue:1 Volume:6 Page:448-463
ISSN:2504-4990
Container-title:Machine Learning and Knowledge Extraction
language:en
Short-container-title:MAKE

Author:

Dutta Subhayu¹^ORCID,Adhikary Subhrangshu²^ORCID,Dwivedi Ashutosh Dhar³^ORCID

Affiliation:

1. Department of Computer Science & Engineering, Dr. B.C. Roy Engineering College, Durgapur 713206, West Bengal, India

2. Department of Research & Development, Spiraldevs Automation Industries Pvt. Ltd., Raiganj 733123, West Bengal, India

3. Cyber Security Group, Aalborg University, DK-2000 Copenhagen, Denmark

Abstract

Complex documents have text, figures, tables, and other elements. The classification of scanned copies of different categories of complex documents like memos, newspapers, letters, and more is essential for rapid digitization. However, this task is very challenging as most scanned complex documents look similar. This is because all documents have similar colors of the page and letters, similar textures for all papers, and very few contrasting features. Several attempts have been made in the state of the art to classify complex documents; however, only a few of these works have addressed the classification of complex documents with similar features, and among these, the performances could be more satisfactory. To overcome this, this paper presents a method to use an optical character reader to extract the texts. It proposes a multi-headed model to combine vision-based transfer learning and natural-language-based Transformers within the same network for simultaneous training for different inputs and optimizers in specific parts of the network. A subset of the Ryers Vision Lab Complex Document Information Processing dataset containing 16 different document classes was used to evaluate the performances. The proposed multi-headed VisFormers network classified the documents with up to 94.2% accuracy, while a regular natural-language-processing-based Transformer network achieved 83%, and vision-based VGG19 transfer learning could achieve only up to 90% accuracy. The model deployment can help sort the scanned copies of various documents into different categories.

Funder

Spiraldevs Automation Industries Pvt. Ltd.

Publisher

MDPI AG

Link

https://www.mdpi.com/2504-4990/6/1/23/pdf

Reference49 articles.

1. Audebert, N., Herold, C., Slimani, K., and Vidal, C. (2019, January 16–20). Multimodal deep networks for text and image-based document classification. Proceedings of the Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany.

2. Adhikari, A., Ram, A., Tang, R., and Lin, J. (2019). Docbert: Bert for document classification. arXiv.

3. Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec;Kim;Inf. Sci.,2019

4. Bhagat, R., Thosani, P., Shah, N., and Shankarmani, R. (2021, January 4–6). Complex Document Classification and Integration with Indexing. Proceedings of the 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India.

5. Biten, A.F., Tito, R., Gomez, L., Valveny, E., and Karatzas, D. (2022, January 23). Ocr-idl: Ocr annotations for industry document library dataset. Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel.