Augmentation-based Pseudo-Ground truth Generation for Deep Learning in Historical Document Segmentation for Greater Levels of Archival Description and Access

Author:

Pack Chulwoo1ORCID,Liu Yi1,Soh Leen-Kiat1,Lorang Elizabeth2

Affiliation:

1. Computer Science & Engineering, University of Nebraska-Lincoln, Lincoln, Nebraska, United States

2. University Libraries, University of Nebraska-Lincoln, Lincoln, Nebraska, United States

Abstract

The successful use of deep learning solutions for document image segmentation typically relies on a large number of manually labeled ground truth examples, which is expensive to obtain for historical document images that have significant noise effects and variation. At the same time, successful applications of deep learning solutions for document image segmentation have rich potential to facilitate greater levels of description in archival collections (e.g., at and below the item-level). These greater levels of description are critical to increasing access and use of archival collections across an array of research domains. In response, this article investigates whether an augmentation-based approach to generating pseudo-ground truth can be effective with a limited number of labeled images in a document segmentation application. The rationale is that if we can decrease the cost of generating ground truth through augmentation-based approaches, we can use these approaches as part of the description and access pipelines for historical library and archival collections. In this initial exploration, we first generate synthetic images and corresponding pseudo-ground truth using a set of existing degradation-based augmentation models from a small number of labeled actual images. When generating synthetic images, we control the visual quality distortion based on OCR word-level confidence to avoid generating images unlikely to be present in the dataset. Then, we perform several investigations to examine the impact of incorporating pseudo-ground truth data in the training of the deep learning network dhSegment and further evaluate the use of multiple combinations of degradation models. We also assess the generalizability of the approach by applying the trained network on a larger dataset. Our investigations primarily use real-world datasets known to have significant noise effects. Results show that augmentation-based pseudo-ground truth generation is capable of improving segmentation performance with the use of the full original dataset and requires only 30% of the original dataset. Results also show that using more than three degradation models is likely to cause overfitting during training. Furthermore, we show that a segmentation network trained on pseudo-ground truth data has generalization capability.

Funder

Institute of Museum and Library Services

National Endowment for the Humanities

Holland Computing Center of the University of Nebraska

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Computer Science Applications,Information Systems,Conservation

Reference50 articles.

1. ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013)

2. dhSegment: A Generic Deep-Learning Approach for Document Segmentation

3. Novel approach for baseline detection and text line segmentation;Bahaghighat Mahdi Keshavarz;International Journal of Computer Applications,2012

4. Big data, big data quality problem

5. The meaningful use of big data

Cited by 4 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3