Affiliation:
1. University of Maryland, College Park, MD
Abstract
We present an adaptive Hindi OCR implemented as part of a rapidly retargetable language tool effort. The system includes: script identification, character segmentation, training sample creation, and character recognition. In script identification, Hindi words are identified from bilingual or multilingual documents based on features of the Devanagari script or using Support Vector Machines. Identified words are then segmented into individual characters in the next step, where the composite characters are identified and further segmented based on the structural properties of the script and statistical information. Segmented characters are recognized using generalized Hausdorff image comparison (GHIC) and postprocessing is applied to improve the performance. The OCR system, which was designed and implemented in one month, was applied to a complete Hindi--English bilingual dictionary and a set of ideal images extracted from Hindi documents in PDF format. Experimental results show the recognition accuracy can reach 88% for noisy images and 95% for ideal images. The presented method can also be extended to design OCR systems for different scripts.
Publisher
Association for Computing Machinery (ACM)
Reference18 articles.
1. Bansal V. 1999. Integrating Knowledge Sources in Devanagari Text Recognition. Ph.D. thesis Indian Institute of Technology Kanpur India. Bansal V. 1999. Integrating Knowledge Sources in Devanagari Text Recognition. Ph.D. thesis Indian Institute of Technology Kanpur India.
2. Segmentation of touching and fused Devanagari characters;Bansal V.;Pattern Recognition,2002
3. Skew angle detection of digitized Indian script documents
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献