Configurable Customized Information Extraction and Processing Pipeline-Reference-Cited by-同舟云学术

Configurable Customized Information Extraction and Processing Pipeline

Published:2024-08-22 Issue: Volume: Page:
ISSN:0218-0014
Container-title:International Journal of Pattern Recognition and Artificial Intelligence
language:en
Short-container-title:Int. J. Patt. Recogn. Artif. Intell.

Author:

Kim Seok¹,Lai Pierce¹,Khan Dariyan¹,Zhao Kevin¹,Le Brian¹,Luchianov Alex¹,Yu Margaret¹,Wang Patrick¹

Affiliation:

1. Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), 32 Vassar Street, Cambridge, MA 02139, USA

Abstract

Extracting information from scanned business documents, while a necessary commercial task, continues to be mostly done manually, requiring significant human effort. Current solutions for automated document information extraction still have limited capabilities in regards to user-required customizability and extraction of dataset-specific information, leaving the area as a very active field of research. In this paper, we propose modifications and improvements to our previously developed custom pipeline for extracting and tabulating key-value pairs from commercial invoice documents. Our design changes and additions adapt the pipeline to a wider variety of document types and use cases, primarily through the implementation of dataset-specific configuration files that promote customizability along with new technical modules that address both general and dataset-specific complexities. We compare our pipeline’s performance against current machine learning and commercial solutions on a real-world dataset, and demonstrate that it is able to extract a wider variety of fields while maintaining competitive or greater accuracies compared to the alternate solutions.

Publisher

World Scientific Pub Co Pte Ltd

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218001424590122

Reference3 articles.

1. Customized Information Extraction and Processing Pipeline for Commercial Invoices

2. Information Extraction System for Invoices and Receipts