Printed Ottoman Text Recognition Using Synthetic Data and Data Augmentation-Reference-Cited by-同舟云学术

Printed Ottoman Text Recognition Using Synthetic Data and Data Augmentation

Published:2022-11-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Tasdemir Esma F. Bilgin¹

Affiliation:

1. Istanbul Medeniyet University

Abstract

Abstract Ottoman script is an Arabic alphabet-basedscript as well. It was a writing system of theTurkish language for several centuries until it was replaced with the modern Turkish script,which is based on the Latin alphabet, in 1928. With the ever increasing digitization campaigns, millions of Ottoman documents are coming to light. But, their contents are not directly accessible, nor they are digitally editable and searchable. OCR and text recognition technologies can bea solution to this problem in the form of auto-mated and semi-automated conversion systems. This study presents a DL-based characterrecognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus, and then augment it using some image processing methods. We develop a hybrid Convolutional Neural Network-Bidirectional Long Short Term Memory recognizer and train it with the original and the augmented datasets. Finally we apply a Transfer Learning procedure for adapting the system to real image data. The proposed system obtains 0.16 CER on a test set containing line images from a historical printed Ottoman book.

Publisher

Research Square Platform LLC

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. OTTOMAN CHARACTER RECOGNITION ON PRINTED DOCUMENTS USING DEEP LEARNING;Mühendislik Bilimleri ve Tasarım Dergisi;2024-06-30