Affiliation:
1. Information Science Research Institute University of Nevada, Las Vegas, USA
Abstract
This paper describes a new automatic spelling correction program to deal with OCR generated errors. The method used here is based on three principles: 1. Approximate string matching between the misspellings and the terms occuring in the database as opposed to the entire dictionary 2. Local information obtained from the individual documents 3. The use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device This system is then utilized to process approximately 10,000 pages of OCR generated documents. Among the misspellings discovered by this algorithm, about 87% were corrected.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. A Contrastive Study on Linguistic Features between HT and MT based on NLPIR-ICTCLAS;2021 5th International Conference on Natural Language Processing and Information Retrieval (NLPIR);2021-12-17
2. Using the Google Web 1T 5-Gram Corpus for OCR Error Correction;16th International Conference on Information Technology-New Generations (ITNG 2019);2019
3. Aligning Ground Truth Text with OCR Degraded Text;Advances in Intelligent Systems and Computing;2019
4. Incorporating linguistic post-processing into whole-book recognition;Document Recognition and Retrieval XVII;2010-01-17
5. Autotag: A tool for creating structured document collections from printed materials;Electronic Publishing, Artistic Imaging, and Digital Typography;1998