A Deep OCR for Degraded Bangla Documents

Author:

Chaudhury Ayan1ORCID,Mukherjee Partha Sarathi2ORCID,Das Sudip3ORCID,Biswas Chandan3ORCID,Bhattacharya Ujjwal4ORCID

Affiliation:

1. INRIA Grenoble Rhône-Alpes, France and IIT Kharagpur, West Bengal, India

2. Tatras Data, New Delhi, Delhi, India

3. Indian Statistical Institute, Kolkata, West Bengal, India

4. Indian Statistical Institute, Kolkata, India

Abstract

Despite the significant success of document image analysis techniques, efficient Optical Character Recognition (OCR) of degraded document images still remains an open problem. Although a body of work has been reported on degraded document recognition for English language, only little attention has been paid to Indic scripts. In this work, we focus on developing a degraded OCR for Bangla, a major Indian language. In general, an OCR system includes segmentation of the foreground text part from the background followed by recognition of the extracted text. The text segmentation module aims to assign the foreground or background label to each pixel of the document image. In this paper, we present a new OCR system which is particularly suitable for degraded quality Bangla document images. The contribution is two fold. In the first phase, we use a semi-supervised Markov Random Field (MRF)- based Generative Adversarial Network (GAN) model (which we call MRF-GAN ) for foreground segmentation of texts from degraded text. In the proposed MRF-GAN , we extend the concept of GAN to a multitask learning mechanism where discriminator-classifier networks differentiate between real/fake images and also assign a foreground or background label to each pixel. In the second phase, we propose to use a new encoder-decoder based recognizer that incorporates an attention-based character to a word prediction model, which has the capability of minimizing Word Error Rate (WER) . We optimize this network using a Multitask based Transfer Learning scheme (MTTL) . We perform experiments on a publicly available degraded Bangla document image dataset as well as on a new degraded printed Hindi document image dataset, which has been created as a part of the present study. Results of the experimentations demonstrate the efficacy of the proposed OCR.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference74 articles.

1. Meduri Avadesh and Navneet Goyal. 2018. Optical character recognition for Sanskrit using convolution neural networks. In DAS. 447–452.

2. B. B. Chaudhuri and U. Pal. 1997. An OCR system to read two Indian language scripts: Bangla and Devnagari (Hindi). In ICDAR. 1011–1015.

3. Neural machine translation by jointly learning to align and translate;Bahdanau Dzmitry;arXiv preprint arXiv:1409.0473,2014

4. S. Banerjee, K. Mullick, and U. Bhattacharya. 2013. A robust approach to extraction of texts from camera captured images. In Proc. of the 5th International Workshop on Camera-Based Document Analysis and Recognition (CBDAR 2013). 53–58.

5. Multi-Layout Unstructured Invoice Documents Dataset: A Dataset for Template-Free Invoice Processing and Its Evaluation Using AI Approaches

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Reading Scene Text with Aggregated Temporal Convolutional Encoder;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-11-20

2. Low Resource Degraded Quality Document Image Binarization – Domain Adaptation is the Way;Proceedings of the Thirteenth Indian Conference on Computer Vision, Graphics and Image Processing;2022-12-08

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3