Segmenting large historical notarial manuscripts into multi-page deeds-Reference-Cited by-同舟云学术

Segmenting large historical notarial manuscripts into multi-page deeds

Published:2024-02-28 Issue:1 Volume:27 Page:
ISSN:1433-7541
Container-title:Pattern Analysis and Applications
language:en
Short-container-title:Pattern Anal Applic

Author:

Prieto Jose Ramón^ORCID,Becerra David,Toselli Alejandro Hector^ORCID,Alonso Carlos^ORCID,Vidal Enrique^ORCID

Abstract

AbstractArchives around the world hold vast digitized series of historical manuscript books or “bundles” containing, among others, notarial records also known as “deeds” or “acts”. One of the first steps to provide metadata which describe the contents of those bundles is to segment them into their individual deeds. Even if deeds are often page-aligned, as in the bundles considered in the present work, this is a time-consuming task, often prohibitive given the huge scale of the manuscript series involved. Unlike traditional Layout Analysis methods for page-level segmentation, our approach goes beyond the realm of a single-page image, providing consistent deed detection results on full bundles. This is achieved in two tightly integrated steps: first, we estimate the class-posterior at the page level for the “initial”, “middle”, and “final” classes; then we “decode” these posteriors applying a series of sequentiality consistency constraints to obtain a consistent book segmentation. Experiments are presented for four large historical manuscripts, varying the number of “deeds” used for training. Two metrics are introduced to assess the quality of book segmentation, one of them taking into account the loss of information entailed by segmentation errors. The problem formalization, the metrics and the empirical work significantly extend our previous works on this topic.

Funder

Ministerio de Ciencia e Innovación

Universitat Politècnica de València

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10044-024-01235-6.pdf

Reference32 articles.

1. Andrés J, Prieto JR, Granell E et al (2022) Information extraction from handwritten tables in historical documents. Document analysis systems (DAS), vol 13237. LNCS Springer, Cham, pp 184–198

2. Biswas S, Riba P, Lladós J et al (2021) Beyond document object detection: instance-level segmentation of complex layouts. Int J Doc Anal Recognit 24:269–281

3. Boillet M, Kermorvant C, Paquet T (2021) Multiple document datasets pre-training improves text line detection with deep neural networks. 2020 25th International conference on pattern recognition (ICPR). IEEE Computer Society, Los Alamitos, pp 2134–2141

4. Boillet M, Kermorvant C, Paquet T (2022) Robust text line detection in historical documents: learning and evaluation methods. Int J Doc Anal Recognit 25:95–114

5. Bosch V, Toselli AH, Vidal E (2012) Statistical text line analysis in handwritten documents. In: 2012 International conference on frontiers in handwriting recognition (ICFHR), IEEE, pp 201–206