DATASET AND GROUND TRUTH FOR HANDWRITTEN TEXT IN FOUR DIFFERENT SCRIPTS-Reference-Cited by-同舟云学术

DATASET AND GROUND TRUTH FOR HANDWRITTEN TEXT IN FOUR DIFFERENT SCRIPTS

Published:2012-06 Issue:04 Volume:26 Page:1253001
ISSN:0218-0014
Container-title:International Journal of Pattern Recognition and Artificial Intelligence
language:en
Short-container-title:Int. J. Patt. Recogn. Artif. Intell.

Author:

ALAEI ALIREZA¹,PAL UMAPADA²,NAGABHUSHAN P.¹

Affiliation:

1. Department of Studies in Computer Science, University of Mysore, Mysore 570 006, India

2. Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata-108, India

Abstract

In document image analysis (DIA) especially in handwritten document recognition, standard databases play significant roles for evaluating performances of algorithms and comparing results obtained by different groups of researchers. The field of DIA regard to Indo-Persian documents is still at its infancy compared to Latin script-based documents; as such standard datasets are not still available in literature. This paper is an effort towards alleviating this gap. In this paper, an unconstrained handwritten dataset containing documents of Persian, Bangla, Oriya and Kannada (PBOK) is introduced. The PBOK contains 707 text-pages written in four different languages (Persian, Bangla, Oriya and Kannada) by 436 individuals. Total number of text-lines, words/subwords and characters are 12,565, 104,541 and 423,980, respectively. In most documents of PBOK dataset contain either an overlapping or a touching text-lines. The average number of text-lines in text-pages of the PBOK dataset is 18. Two types of ground truths, based on pixels information and content information, are generated for the dataset. Because of such ground truths, the PBOK dataset can be utilized in many areas of document image processing e.g. text-line segmentation, word segmentation and word recognition. To provide an insight for other researches, recent text-line segmentation results on this dataset are also reported.

Publisher

World Scientific Pub Co Pte Lt

Subject

Artificial Intelligence,Computer Vision and Pattern Recognition,Software

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218001412530011

Reference23 articles.

1. Piece-wise painting technique for line segmentation of unconstrained handwritten text: a specific study with Persian text documents

2. A new scheme for unconstrained handwritten text-line segmentation

3. Databases for recognition of handwritten Arabic cheques

4. Hidden Markov model-based ensemble methods for offline handwritten text line recognition

Cited by 34 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Digitizing History: Transitioning Historical Paper Documents to Digital Content for Information Retrieval and Mining—A Comprehensive Survey;IEEE Transactions on Computational Social Systems;2024

2. An Advanced Modified Freeman Chain Code Algorithm for Enhancing Arabic Character Recognition;Information Systems Engineering and Management;2024

3. MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification;Cognitive Computation;2023-08-25

4. A Progressive Approach to Arabic Character Recognition Using a Modified Freeman Chain Code Algorithm;Data and Metadata;2023-05-08

5. Efficient Scalable Template-Matching Technique for Ancient Brahmi Script營mage;Computers, Materials & Continua;2023