A Deep OCR for Degraded Bangla Documents-Reference-Cited by-同舟云学术

A Deep OCR for Degraded Bangla Documents

Published:2022-08-25 Issue:5 Volume:21 Page:1-20
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Chaudhury Ayan¹^ORCID,Mukherjee Partha Sarathi²^ORCID,Das Sudip³^ORCID,Biswas Chandan³^ORCID,Bhattacharya Ujjwal⁴^ORCID

Affiliation:

1. INRIA Grenoble Rhône-Alpes, France and IIT Kharagpur, West Bengal, India

2. Tatras Data, New Delhi, Delhi, India

3. Indian Statistical Institute, Kolkata, West Bengal, India

4. Indian Statistical Institute, Kolkata, India

Abstract

Despite the significant success of document image analysis techniques, efficient Optical Character Recognition (OCR) of degraded document images still remains an open problem. Although a body of work has been reported on degraded document recognition for English language, only little attention has been paid to Indic scripts. In this work, we focus on developing a degraded OCR for Bangla, a major Indian language. In general, an OCR system includes segmentation of the foreground text part from the background followed by recognition of the extracted text. The text segmentation module aims to assign the foreground or background label to each pixel of the document image. In this paper, we present a new OCR system which is particularly suitable for degraded quality Bangla document images. The contribution is two fold. In the first phase, we use a semi-supervised Markov Random Field (MRF)- based Generative Adversarial Network (GAN) model (which we call MRF-GAN ) for foreground segmentation of texts from degraded text. In the proposed MRF-GAN , we extend the concept of GAN to a multitask learning mechanism where discriminator-classifier networks differentiate between real/fake images and also assign a foreground or background label to each pixel. In the second phase, we propose to use a new encoder-decoder based recognizer that incorporates an attention-based character to a word prediction model, which has the capability of minimizing Word Error Rate (WER) . We optimize this network using a Multitask based Transfer Learning scheme (MTTL) . We perform experiments on a publicly available degraded Bangla document image dataset as well as on a new degraded printed Hindi document image dataset, which has been created as a part of the present study. Results of the experimentations demonstrate the efficacy of the proposed OCR.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3511807

Reference74 articles.

1. Meduri Avadesh and Navneet Goyal. 2018. Optical character recognition for Sanskrit using convolution neural networks. In DAS. 447–452.

2. B. B. Chaudhuri and U. Pal. 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In ICDAR. 1011–1015.

3. Neural machine translation by jointly learning to align and translate;Bahdanau Dzmitry;arXiv preprint arXiv:1409.0473,2014

4. S. Banerjee, K. Mullick, and U. Bhattacharya. 2013. A robust approach to extraction of texts from camera captured images. In Proc. of the 5th International Workshop on Camera-Based Document Analysis and Recognition (CBDAR 2013). 53–58.

5. Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Approaches

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Reading Scene Text with Aggregated Temporal Convolutional Encoder;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-11-20

2. Low Resource Degraded Quality Document Image Binarization – Domain Adaptation is the Way;Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing;2022-12-08