Generic HTR Models for Medieval Manuscripts. The CREMMALab Project

Author:

Pinche Ariane12ORCID

Affiliation:

1. Histoire, Archéologie et Littératures des mondes chrétiens et musulmans médiévaux

2. Centre National de la Recherche Scientifique

Abstract

In the Humanities, the emergence of digital methods has opened up research questions to quantitative analysis. This is why HTR technology is increasingly involved in humanities research projects following precursors such as the Himanis project. However, many research teams have limited resources, either financially or in terms of their expertise in artificial intelligence. It may therefore be difficult to integrate handwritten text recognition into their project pipeline if they need to train a model or to create data from scratch. The goal here is not to explain how to build or improve a new HTR engine, nor to find a way to automatically align a preexisting corpus with an image to quickly create ground truths for training. This paper aims to help humanists easily develop an HTR model for medieval manuscripts, create and gather training data by knowing the issues underlying their choices. The objective is also to show the importance of the constitution of consistent data as a prerequisite to allow their gathering and to train efficient HTR models. We will present an overview of our work and experiment in the CREMMALab project (2021-2022), showing first how we ensure the consistency of the data and then how we have developed a generic model for medieval French manuscripts from the 13 th to the 15 th century, ready to be shared (more than 94% accuracy) and/or fine-tuned by other projects.

Publisher

Centre pour la Communication Scientifique Directe (CCSD)

Subject

General Earth and Planetary Sciences,General Engineering,General Environmental Science

Reference31 articles.

1. Noisy medieval data, from digitized manuscript to stylometric analysis: Evaluating Paul Meyer’s hagiographic hypothesis

2. Handling Heavily Abbreviated Manuscripts: HTR Engines vs Text Normalisation Approaches

3. Alix Chagué, Thibault Clérice, and Laurent Romary. HTR-United : Mutualisons la vérité de terrain ! October 2021. URL https://hal.archives-ouvertes.fr/hal-03398740.

4. Thibault Clérice. You Actually Look Twice At it (YALTAi): using an object detection approach instead of region segmentation within the Kraken engine. July 2022. URL https://hal-enc. archives-ouvertes.fr/hal-03723208.

5. Thibault Clérice and Ariane Pinche. Choco-Mufin, a tool for controlling characters used in OCR and HTR projects, September 2021a. URL https://github.com/PonteIneptique/choco-mufin. manuscript fr. 412: see [Camps et al., 2021a]. 29 Further research is already being done to provide a better model using object detection, see Clérice

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3