Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application-Reference-Cited by-同舟云学术

Unstructured Document Information Extraction Method with Multi-Faceted Domain Knowledge Graph Assistance for M2M Customs Risk Prevention and Screening Application

Published:2024-05-15 Issue:10 Volume:13 Page:1941
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Tian Fengchun¹^ORCID,Wang Haochen¹,Wan Zhenlong²,Liu Ran¹^ORCID,Liu Ruilong¹,Lv Di¹,Lin Yingcheng¹^ORCID

Affiliation:

1. The School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

2. National Information Center of GACC, Beijing 100010, China

Abstract

As a crucial national security defense line, the existing risk prevention and screening system of customs falls short in terms of intelligence and diversity for risk identification factors. Hence, the urgent issues to be addressed in the risk identification system include intelligent extraction technology for key information from Customs Unstructured Accompanying Documents (CUADs) and the reliability of the extraction results. In the customs scenario, OCR is employed for M2M interactions, but current models have difficulty adapting to diverse image qualities and complex customs document content. We propose a hybrid mutual learning knowledge distillation (HMLKD) method for optimizing a pre-trained OCR model’s performance against such challenges. Additionally, current models lack effective incorporation of domain-specific knowledge, resulting in insufficient text recognition accuracy for practical customs risk identification. We propose a customs domain knowledge graph (CDKG) developed using CUAD knowledge and propose an integrated CDKG post-OCR correction method (iCDKG-PostOCR) based on CDKG. The results on real data demonstrate that the accuracies improve for code text fields to 97.70%, for character type fields to 96.55%, and for numerical type fields to 96.00%, with a confidence rate exceeding 99% for each. Furthermore, the Customs Health Certificate Extraction System (CHCES) developed using the proposed method has been implemented and verified at Tianjin Customs in China, where it has showcased outstanding operational performance.

Funder

National Key Research and Development Program of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/10/1941/pdf

Reference44 articles.

1. Chakraborty, S., Harit, G., and Ghosh, S. (2023, January 21–26). TransDocAnalyser: A framework for semi-structured offline handwritten documents analysis with an application to legal domain. Proceedings of the International Conference on Document Analysis and Recognition, San Jose, CA, USA.

2. Optical character recognition on bank cheques using 2D convolution neural network;Srivastava;Proceedings of the Applications of Artificial Intelligence Techniques in Engineering: SIGMA 2018,2019

3. Pradipta, D.J., Handayani, P.W., and Shihab, M.R. (2021, January 9–11). Evaluation of the customs document lane system effectiveness: A case study in Indonesia. Proceedings of the 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia.

4. Modern Customs Risk Management Framework: Improvement towards Institutional Reform;Basir;Int. J. Innov. Sci. Res. Technol.,2019

5. Historical review of OCR research and development;Mori;Proc. IEEE,1992