Study of Tesseract OCR-Reference-Cited by-同舟云学术

Study of Tesseract OCR

Published:2024-03-21 Issue:2 Volume:1 Page:41-50
ISSN:2583-2492
Container-title:GLS KALP: Journal of Multidisciplinary Studies
language:
Short-container-title:GLS KALP

Author:

Joshi Kartik

Abstract

In the current Internet and Digitization era, a huge amount of information is available in different forms like books, newspapers, etc. To preserve the contents of such documents, these documents are converted to a digital format by scanning them as images. Detection of text from the scanned images and correct identification of characters is a challenging problem in such cases. Tesseract is a recognition engine based upon open source license which uses some novel techniques for optical character recognition. Tesseract has been designed to recognize more than 100 languages. Few of these languages are English, Italian, French, German, Spanish, Dutch and many more. It also works for a few Indian languages such as Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya and others. OCR is the branch of image recognition that is used in applications to recognize text from scanned documents or images. Today combined with the field of Artificial Intelligence this technology is becoming a boon to capture and comprehend the data automatically. In this paper, the researcher has done a detailed study of the working of the Tesseract OCR.

Publisher

GLS University

Reference11 articles.

1. Bhatt, A. (2014). Information needs, perceptions and quests of law faculty in the digital era. The Electronic Library, 32(5), 659–669. https://doi.org/10.1108/el-11-2012-0152

2. Blesser, B. A., Kuklinski, T. T., & Shillman, R. J. (1976). Empirical tests for feature selection based on a psychological theory of character recognition. Pattern Recognition, 8(2), 77-85.

3. Bokser, M. (1992). Omnidocument technologies. Proceedings of the IEEE, 80(7), 1066-1078.

4. Leptonica image processing and analysis library. http://www.leptonica.com.

5. Macwan, S. J., & Vyas, A. N. (2015, August). Classification of offline Gujarati handwritten characters. In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1535-1541). IEEE.