Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR-Reference-Cited by-同舟云学术

Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR

Published:2021-10-19 Issue:20 Volume:11 Page:9752
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Idrees Saman,Hassani Hossein^ORCID

Abstract

Applications based on Long-Short-Term Memory (LSTM) require large amounts of data for their training. Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages. However, its training becomes obstructed when the target language is not resourceful. This research suggests a remedy for the problem of scant data in training Tesseract LSTM for a new language by exploiting a training dataset for a language with a similar script. The target of the experiment is Kurdish. It is a multi-dialect language and is considered less-resourced. We choose Sorani, one of the Kurdish dialects, that is mostly written in Persian-Arabic script. We train Tesseract using an Arabic dataset, and then we use a considerably small amount of texts in Persian-Arabic to train the engine to recognize Sorani texts. Our dataset is based on a series of court case documents in the Kurdistan Region of Iraq. We also fine-tune the engine using 10 Unikurd fonts. We use Lstmeval and Ocreval to evaluate the outputs. The result indicates the achievement of 95.45% accuracy. We also test the engine using texts outside the context of court cases. The accuracy of the system remains close to what was found earlier indicating that the script similarity could be used to overcome the lack of large-scale data.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/20/9752/pdf

Reference39 articles.

1. BLARK for multi-dialect languages: towards the Kurdish BLARK

2. Effective long short-term memory with fruit fly optimization algorithm for time series forecasting

3. Effective energy consumption forecasting using empirical wavelet transform and long short-term memory

4. Commercial Vacancy Prediction Using LSTM Neural Networks

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A scarce dataset for ancient Arabic handwritten text recognition;Data in Brief;2024-10

2. Sentiment Analysis of Opinions about Online Education in the Kurdistan Region of Iraq during COVID-19;Qeios;2023-09-12

3. A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges;Applied Sciences;2023-04-04

4. Dhivehi OCR Engine;2022 IEEE International Conference on Distributed Computing and Electrical Circuits and Electronics (ICDCECE);2022-04-23