Augmentation-based Pseudo-Ground truth Generation for Deep Learning in Historical Document Segmentation for Greater Levels of Archival Description and Access-Reference-Cited by-同舟云学术

Augmentation-based Pseudo-Ground truth Generation for Deep Learning in Historical Document Segmentation for Greater Levels of Archival Description and Access

Published:2022-09-16 Issue:3 Volume:15 Page:1-21
ISSN:1556-4673
Container-title:Journal on Computing and Cultural Heritage
language:en
Short-container-title:J. Comput. Cult. Herit.

Author:

Pack Chulwoo¹^ORCID,Liu Yi¹,Soh Leen-Kiat¹,Lorang Elizabeth²

Affiliation:

1. Computer Science & Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, United States

2. University Libraries, University of Nebraska-Lincoln, Lincoln, Nebraska, United States

Abstract

The successful use of deep learning solutions for document image segmentation typically relies on a large number of manually labeled ground truth examples, which is expensive to obtain for historical document images that have significant noise effects and variation. At the same time, successful applications of deep learning solutions for document image segmentation have rich potential to facilitate greater levels of description in archival collections (e.g., at and below the item-level). These greater levels of description are critical to increasing access and use of archival collections across an array of research domains. In response, this article investigates whether an augmentation-based approach to generating pseudo-ground truth can be effective with a limited number of labeled images in a document segmentation application. The rationale is that if we can decrease the cost of generating ground truth through augmentation-based approaches, we can use these approaches as part of the description and access pipelines for historical library and archival collections. In this initial exploration, we first generate synthetic images and corresponding pseudo-ground truth using a set of existing degradation-based augmentation models from a small number of labeled actual images. When generating synthetic images, we control the visual quality distortion based on OCR word-level confidence to avoid generating images unlikely to be present in the dataset. Then, we perform several investigations to examine the impact of incorporating pseudo-ground truth data in the training of the deep learning network dhSegment and further evaluate the use of multiple combinations of degradation models. We also assess the generalizability of the approach by applying the trained network on a larger dataset. Our investigations primarily use real-world datasets known to have significant noise effects. Results show that augmentation-based pseudo-ground truth generation is capable of improving segmentation performance with the use of the full original dataset and requires only 30% of the original dataset. Results also show that using more than three degradation models is likely to cause overfitting during training. Furthermore, we show that a segmentation network trained on pseudo-ground truth data has generalization capability.

Funder

Institute of Museum and Library Services

National Endowment for the Humanities

Holland Computing Center of the University of Nebraska

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Computer Science Applications,Information Systems,Conservation

Link

https://dl.acm.org/doi/pdf/10.1145/3485845

Reference50 articles.

1. ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013)