Data-Driven Recognition and Extraction of PDF Document Elements-Reference-Cited by-同舟云学术

Data-Driven Recognition and Extraction of PDF Document Elements

Published:2019-09-11 Issue:3 Volume:7 Page:65
ISSN:2227-7080
Container-title:Technologies
language:en
Short-container-title:Technologies

Author:

Hansen Matthias,Pomp André^ORCID,Erki Kemal,Meisen Tobias^ORCID

Abstract

In the age of digitalization, the collection and analysis of large amounts of data is becoming increasingly important for enterprises to improve their businesses and processes, such as the introduction of new services or the realization of resource-efficient production. Enterprises concentrate strongly on the integration, analysis and processing of their data. Unfortunately, the majority of data analysis focuses on structured and semi-structured data, although unstructured data such as text documents or images account for the largest share of all available enterprise data. One reason for this is that most of this data is not machine-readable and requires dedicated analysis methods, such as natural language processing for analyzing textual documents or object recognition for recognizing objects in images. Especially in the latter case, the analysis methods depend strongly on the application. However, there are also data formats, such as PDF documents, which are not machine-readable and consist of many different document elements such as tables, figures or text sections. Although the analysis of PDF documents is a major challenge, they are used in all enterprises and contain various information that may contribute to analysis use cases. In order to enable their efficient retrievability and analysis, it is necessary to identify the different types of document elements so that we are able to process them with tailor-made approaches. In this paper, we propose a system that forms the basis for structuring unstructured PDF documents, so that the identified document elements can subsequently be retrieved and analyzed with tailor-made approaches. Due to the high diversity of possible document elements and analysis methods, this paper focuses on the automatic identification and extraction of data visualizations, algorithms, other diagram-like objects and tables from a mixed document body. For that, we present two different approaches. The first approach uses methods from the area of deep learning and rule-based image processing whereas the second approach is purely based on deep learning. To train our neural networks, we manually annotated a large corpus of PDF documents with our own annotation tool, of which both are being published together with this paper. The results of our extraction pipeline show that we are able to automatically extract graphical items with a precision of 0.73 and a recall of 0.8. For tables, we reach a precision of 0.78 and a recall of 0.94.

Publisher

MDPI AG

Link

https://www.mdpi.com/2227-7080/7/3/65/pdf

Reference24 articles.

1. Big Data for Dummies;Hurwitz,2013

2. Applying Semantics to Reduce the Time to Analytics within Complex Heterogeneous Infrastructures

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Building datasets to support information extraction and structure parsing from electronic theses and dissertations;International Journal on Digital Libraries;2024-05-03

2. Table structure recognition using black widow based mutual exclusion and RESNET attention model;Journal of Intelligent & Fuzzy Systems;2024-01-10

3. Abstract and Image Analysis of High-Temperature Materials from Scientific Journals Using Deep Learning and Rule-Based Machine Learning Approaches;Lecture Notes in Electrical Engineering;2021-11-09

4. ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations;2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL);2021-09

5. Extraction of dimension requirements from engineering drawings for supporting quality control in production processes;Computers in Industry;2021-08