Handwritten stenography recognition and the LION dataset

Author:

Heil RaphaelaORCID,Nauwerck MalinORCID

Abstract

AbstractIn this paper, we establish the first baseline for handwritten stenography recognition, using the novel LION dataset, and investigate the impact of including selected aspects of stenographic theory into the recognition process. We make the LION dataset publicly available with the aim of encouraging future research in handwritten stenography recognition. A state-of-the-art text recognition model is trained to establish a baseline. Stenographic domain knowledge is integrated by transforming the target sequences into representations which approximate diplomatic transcriptions, wherein each symbol in the script is represented by its own character in the transliteration, as opposed to corresponding combinations of characters from the Swedish alphabet. Four such encoding schemes are evaluated and results are further improved by integrating a pre-training scheme, based on synthetic data. The baseline model achieves an average test character error rate (CER) of 29.81% and a word error rate (WER) of 55.14%. Test error rates are reduced significantly (p< 0.01) by combining stenography-specific target sequence encodings with pre-training and fine-tuning, yielding CERs in the range of 24.5–26% and WERs of 44.8–48.2%. An analysis of selected recognition errors illustrates the challenges that the stenographic writing system poses to text recognition. This work establishes the first baseline for handwritten stenography recognition. Our proposed combination of integrating stenography-specific knowledge, in conjunction with pre-training and fine-tuning on synthetic data, yields considerable improvements. Together with our precursor study on the subject, this is the first work to apply modern handwritten text recognition to stenography. The dataset and our code are publicly available via Zenodo.

Funder

Uppsala University

Publisher

Springer Science and Business Media LLC

Reference52 articles.

1. Nauwerck, M.: Storyteller, stenographer, and self-published superstar: how Astrid Lindgren’s multiple roles in book production created the Lindgren myth. Mém. Livre Stud. Book Cult. 13(1), 1–37 (2022). https://doi.org/10.7202/1094130ar

2. Bohlund, K.: Den Okända Astrid Lindgren: Åren Som Bokförläggare och Chef. Astrid Lindgren Text, Stockholm (2018)

3. Andersen, J., Andersson, U.: Denna Dagen, Ett Liv: en Biografi Över Astrid Lindgren. Norstedt, Stockholm (2014)

4. Törnqvist, L.: Man Tar Vanliga Ord: Att Läsa Om Astrid Lindgren. Salikon förl, Stockholm, Sweden (2015)

5. The Swedish Institute for Children’s Books: About the Astrid Lindgren code (2022). https://www.barnboksinstitutet.se/en/forskning/astrid-lindgren-koden/. Accessed 22 Feb 2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3