From Digitization and Images to Text and Content: Transkribus as a Case Study

Author:

Prebor Gila

Abstract

Abstract: Over the last decades, libraries and archives have been increasingly investing in the digitization of their collections, including manuscripts, rare books, newspapers, archival material, and more. Many of these resources are freely accessible. However, the material accessible consists only of the metadata of the resources along with their images. The textual content of the resulting digital images is not yet visible and those seeking to find the content of the resources must study and transcribe individual passages. This article has demonstrated the immense potential of technological tools in the transcription of Hebrew manuscripts. Through our analysis, we have shown that handwriting recognition models trained with Transkribus can generate usable results when applied to Hebrew Sephardic semi-cursive manuscripts from the 15th century. This marks a significant advancement in the field, as it allows for a more efficient and cost-effective approach to transcription. Our findings highlight that even with a relatively small investment, remarkable results can be achieved. The recommended amount of ground truth data for training a Transkribus model is set at approximately 15,000 transcribed words or 75 pages to recognize text written by a single hand. Adhering to the principles of machine learning, the submission of a larger volume of ground truth data enhances the accuracy of the transcription results. However, our trials have shown that even with a smaller amount of data, it is still possible to attain good outcomes. This is a promising prospect, as it facilitates the mass digitization of previously unpublished manuscripts, opening up vast opportunities for future research endeavors.

Publisher

Project MUSE

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3