Abstract
In the current Internet and Digitization era, a huge amount of information is available in different forms like books, newspapers, etc. To preserve the contents of such documents, these documents are converted to a digital format by scanning them as images. Detection of text from the scanned images and correct identification of characters is a challenging problem in such cases. Tesseract is a recognition engine based upon open source license which uses some novel techniques for optical character recognition. Tesseract has been designed to recognize more than 100 languages. Few of these languages are English, Italian, French, German, Spanish, Dutch and many more. It also works for a few Indian languages such as Bengali, Gujarati, Hindi, Kannada, Malayalam, Oriya and others. OCR is the branch of image recognition that is used in applications to recognize text from scanned documents or images. Today combined with the field of Artificial Intelligence this technology is becoming a boon to capture and comprehend the data automatically. In this paper, the researcher has done a detailed study of the working of the Tesseract OCR.