Affiliation:
1. National Institute of Technology Manipur, Computer Science and Engineering, Imphal, India
2. National Institute of Technology Jaipur, Computer Science and Engineering, Rajasthan, India
Abstract
This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwritten sentences along with their structural annotations for the offline handwritten text images with their XML representation. Urdu is the fourth most frequently used language in the world, but due to its complex cursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unified approach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts, and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images, 3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words and handwritten styles, data collection is distributed among six categories and 14 subcategories. Handwritten forms were filled out by 725 different writers belonging to different geographical regions, ages, and genders with diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu script images at line, word, and ligature levels with an XML standard to provide a ground truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use, and so on. The experimental results of some recently developed handwritten text line segmentation techniques experimented on the proposed dataset are also presented in the article for asserting its viability and usability.
Publisher
Association for Computing Machinery (ACM)
Cited by
14 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献