Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition-Reference-Cited by-同舟云学术

Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

Published:2021-07-20 Issue:7 Volume:6 Page:78
ISSN:2306-5729
Container-title:Data
language:en
Short-container-title:Data

Author:

Baviskar Dipali^ORCID,Ahirrao Swati,Kotecha Ketan^ORCID

Abstract

The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents.

Publisher

MDPI AG

Subject

Information Systems and Management,Computer Science Applications,Information Systems

Link

https://www.mdpi.com/2306-5729/6/7/78/pdf

Reference22 articles.

1. 30 Eye-Opening Big Data Statistics for 2020: Patterns Are Everywhere https://kommandotech.com/statistics/big-data-statistics/

2. A Bibliometric Survey on Cognitive Document Processing;Philosophy;Libr. Philos. Pract.,2020

3. Efficient Automated Processing of the Unstructured Documents Using Artificial Intelligence: A Systematic Literature Review and Future Directions

4. Limitations of information extraction methods and techniques for heterogeneous unstructured big data

5. An analytical study of information extraction from unstructured and multidimensional big data

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A deep learning-based solution for digitization of invoice images with automatic invoice generation and labelling;International Journal on Document Analysis and Recognition (IJDAR);2023-08-25

2. Business Document Information Extraction: Towards Practical Benchmarks;Lecture Notes in Computer Science;2022