Abstract
Abstract
We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.
Publisher
Springer Science and Business Media LLC
Subject
Computer Science Applications,Computer Vision and Pattern Recognition,Software
Reference33 articles.
1. Pletschacher, S., Clausner, C., Antonacopoulos, A.: Europeana newspapers OCR workflow evaluation. In: Proceedings of the 2015 Workshop on Historical Document Imaging and Processing (HIP2015), Nancy, France, pp. 39–46 (2015)
2. Clausner, C., Pletschacher, S., Antonacopoulos, A.: Aletheia—an advanced document layout and text ground-truthing system for production environments. In: Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR2011), Beijing, China, pp. 48–52 (2011)
3. Tesseract OCR,
https://github.com/tesseract-ocr
. Accessed 28 Aug 2019
4. Clausner, C., Pletschacher, S., Antonacopoulos, A.: Efficient OCR training data generation with Aletheia”. In: Short Paper Booklet of the 11th International Association for Pattern Recognition (IAPR) Workshop on Document Analysis Systems (DAS2014), Tours, France, pp. 19–20 (2014)
5. IMPACT Project:
http://www.impact-project.eu
, Accessed 28 Aug 2019
Cited by
14 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献