Affiliation:
1. Istanbul Medeniyet University
Abstract
Abstract
Ottoman script is an Arabic alphabet-basedscript as well. It was a writing system of theTurkish language for several centuries until it was replaced with the modern Turkish script,which is based on the Latin alphabet, in 1928. With the ever increasing digitization campaigns, millions of Ottoman documents are coming to light. But, their contents are not directly accessible, nor they are digitally editable and searchable. OCR and text recognition technologies can bea solution to this problem in the form of auto-mated and semi-automated conversion systems. This study presents a DL-based characterrecognition system for the printed Ottoman script. We first generate a synthetic text image dataset from a text corpus, and then augment it using some image processing methods. We develop a hybrid Convolutional Neural Network-Bidirectional Long Short Term Memory recognizer and train it with the original and the augmented datasets. Finally we apply a Transfer Learning procedure for adapting the system to real image data. The proposed system obtains 0.16 CER on a test set containing line images from a historical printed Ottoman book.
Publisher
Research Square Platform LLC
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献