A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script-Reference-Cited by-同舟云学术

A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script

Published:2016-06-02 Issue:4 Volume:15 Page:1-23
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Choudhary Prakash¹,Nain Neeta²

Affiliation:

1. National Institute of Technology Manipur, Computer Science and Engineering, Imphal, India

2. National Institute of Technology Jaipur, Computer Science and Engineering, Rajasthan, India

Abstract

This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwritten sentences along with their structural annotations for the offline handwritten text images with their XML representation. Urdu is the fourth most frequently used language in the world, but due to its complex cursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unified approach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts, and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images, 3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words and handwritten styles, data collection is distributed among six categories and 14 subcategories. Handwritten forms were filled out by 725 different writers belonging to different geographical regions, ages, and genders with diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu script images at line, word, and ligature levels with an XML standard to provide a ground truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use, and so on. The experimental results of some recently developed handwritten text line segmentation techniques experimented on the proposed dataset are also presented in the article for asserting its viability and usability.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/2857053

Reference65 articles.

1. Databases for recognition of handwritten Arabic cheques

2. A Benchmark Kannada Handwritten Document Dataset and Its Segmentation

3. A new scheme for unconstrained handwritten text-line segmentation

4. DATASET AND GROUND TRUTH FOR HANDWRITTEN TEXT IN FOUR DIFFERENT SCRIPTS

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Efficient CRNN: Towards end-to-end low resource Urdu text recognition using depthwise separable convolutions and gated recurrent units;Information Processing & Management;2024-01

2. Analysis of Cursive Text Recognition Systems: A Systematic Literature Review;ACM Transactions on Asian and Low-Resource Language Information Processing;2023-07-20

3. UTRNet: High-Resolution Urdu Text Recognition in Printed Documents;Lecture Notes in Computer Science;2023

4. UrduAI: Writeprints for Urdu Authorship Identification;ACM Transactions on Asian and Low-Resource Language Information Processing;2022-03-31

5. Word Level Script Identification Using Convolutional Neural Network Enhancement for Scenic Images;ACM Transactions on Asian and Low-Resource Language Information Processing;2022-03-04