Affiliation:
1. Birla Institute of Technology, Mesra, Ranchi, India
Abstract
This article introduces a new advanced tri-layered segmentation and bi-leveled-classifier-based Hindi printed document classification system, which categorizes imaged documents into pre-defined mutually exclusive categories by using SVM and Fuzzy matching at character and document classifications, respectively. During training, the improved and noise-free image is segmented into lines and words by profiling. Then it obtains Shirorekha Less (SL) isolated characters along with upper, left and right modifier components from the SL words. These components use their locations and inter character-modifier component distance to get associate with their corresponding characters only. Further, confidence values of all characters are calculated with SVM training and all characters are mapped into Romanized labels to generate the words. Finally, documents are classified by Fuzzy based matching of Romanized detected words and predefined classes. The average execution times of SL characters are 0.22675 sec. and 0.20375 sec. and classification accuracy are 74.61% and 80.73% for training and testing, respectively.
Reference50 articles.
1. Generalization of Hindi OCR Using Adaptive Segmentation and Font Files
2. Text line script identification for a tri-lingual document
3. A survey on optical character recognition for Bangla and Devanagari scripts
4. Review on extraction techniques for images, text lines and keywords from document images.;S. L.Bagadkar;International Conference on Computational and Computing Research,2014
5. Two-stage approach for word-wise script identification.;S.Chanda;10th International Conference on Document Analysis and Recognition,2009
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献