Augmentation-based Pseudo-Ground truth Generation for Deep Learning in Historical Document Segmentation for Greater Levels of Archival Description and Access
-
Published:2022-09-16
Issue:3
Volume:15
Page:1-21
-
ISSN:1556-4673
-
Container-title:Journal on Computing and Cultural Heritage
-
language:en
-
Short-container-title:J. Comput. Cult. Herit.
Author:
Pack Chulwoo1ORCID,
Liu Yi1,
Soh Leen-Kiat1,
Lorang Elizabeth2
Affiliation:
1. Computer Science & Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, United States
2. University Libraries, University of Nebraska-Lincoln, Lincoln, Nebraska, United States
Abstract
The successful use of deep learning solutions for document image segmentation typically relies on a large number of manually labeled ground truth examples, which is expensive to obtain for historical document images that have significant noise effects and variation. At the same time, successful applications of deep learning solutions for document image segmentation have rich potential to facilitate greater levels of description in archival collections (e.g., at and below the item-level). These greater levels of description are critical to increasing access and use of archival collections across an array of research domains. In response, this article investigates whether an augmentation-based approach to generating pseudo-ground truth can be effective with a limited number of labeled images in a document segmentation application. The rationale is that if we can decrease the cost of generating ground truth through augmentation-based approaches, we can use these approaches as part of the description and access pipelines for historical library and archival collections. In this initial exploration, we first generate synthetic images and corresponding pseudo-ground truth using a set of existing degradation-based augmentation models from a small number of labeled actual images. When generating synthetic images, we control the visual quality distortion based on OCR word-level confidence to avoid generating images unlikely to be present in the dataset. Then, we perform several investigations to examine the impact of incorporating pseudo-ground truth data in the training of the deep learning network dhSegment and further evaluate the use of multiple combinations of degradation models. We also assess the generalizability of the approach by applying the trained network on a larger dataset. Our investigations primarily use real-world datasets known to have significant noise effects. Results show that augmentation-based pseudo-ground truth generation is capable of improving segmentation performance with the use of the full original dataset and requires only 30% of the original dataset. Results also show that using more than three degradation models is likely to cause overfitting during training. Furthermore, we show that a segmentation network trained on pseudo-ground truth data has generalization capability.
Funder
Institute of Museum and Library Services
National Endowment for the Humanities
Holland Computing Center of the University of Nebraska
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Computer Science Applications,Information Systems,Conservation
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献