Cross-Modal Multiple Granularity Interactive Fusion Network for Long Document Classification-Reference-Cited by-同舟云学术

Cross-Modal Multiple Granularity Interactive Fusion Network for Long Document Classification

Published:2023-11-06 Issue: Volume: Page:
ISSN:1556-4681
Container-title:ACM Transactions on Knowledge Discovery from Data
language:en
Short-container-title:ACM Trans. Knowl. Discov. Data

Author:

Liu Tengfei¹,Hu Yongli¹,Gao Junbin²,Sun Yanfeng¹,Yin Baocai¹

Affiliation:

1. Beijing University of Technology, China

2. The University of Sydney, Australia

Abstract

Long Document Classification (LDC) has attracted great attention in NLP and achieved considerable progress owing to the large-scale pre-trained language models. In spite of this, as a different problem from the traditional text classification, LDC is far from being settled. Long documents, such as news and articles, generally have more than thousands of words with complex structures. Moreover, compared with flat text, long documents usually contain multi-modal content of images, which provide rich information but not yet being utilized for classification. In this paper, we propose a novel cross-modal method for long document classification, in which multiple granularity feature shifting networks are proposed to integrate the multi-scale text and visual features of long documents adaptively. Additionally, a multi-modal collaborative pooling block is proposed to eliminate redundant fine-grained text features, and simultaneously reduce the computational complexity. To verify the effectiveness of the proposed model, we conduct experiments on the Food101 dataset and two constructed multi-modal long document datasets. The experimental results show that the proposed cross-modal method outperforms the single-modal text methods and defeats the state-of-the-art related multi-modal baselines.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3631711

Reference52 articles.

1. Rethinking Complex Neural Network Architectures for Document Classification

2. ETC: Encoding Long and Structured Inputs in Transformers

3. Construction of the Literature Graph in Semantic Scholar

4. John Arevalo , Thamar Solorio , Manuel Montes y Gómez , and Fabio A. González . 2017 . Gated Multimodal Units for Information Fusion . In Proceedings of the International Conference on Learning Representations. John Arevalo, Thamar Solorio, Manuel Montes y Gómez, and Fabio A. González. 2017. Gated Multimodal Units for Information Fusion. In Proceedings of the International Conference on Learning Representations.

5. Iz Beltagy , Matthew E. Peters , and Arman Cohan . 2020 . Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020). Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).