Affiliation:
1. CEDAR and Department of Computer Science, State University of New York at Buffalo, Buffalo, NY 14228–2567, USA
Abstract
Automatic analysis of images of forms is a problem of both practical and theoretical interest; due to its importance in office automation, and due to the conceptual challenges posed for document image analysis, respectively. We describe an approach to the extraction of text, both typed and handwritten, from scanned and digitized images of filled-out forms. In decomposing a filled-out form into three basic components of boxes, line segments and the remainder (handwritten and typed characters, words, and logos), the method does not use a priori knowledge of form structure. The input binary image is first segmented into small and large connected components. Complex boxes are decomposed into elementary regions using an approach based on key-point analysis. Handwritten and machine-printed text that touches or overlaps guide lines and boxes are separated by removing lines. Characters broken by line removal are rejoined using a character patching method. Experimental results with filled-out forms, from several different domains (insurance, banking, tax, retail and postal) are given.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Intervention of light convolutional neural network in document survey form processing;Multimedia Tools and Applications;2023-07-01
2. Text and non-text separation in offline document images: a survey;International Journal on Document Analysis and Recognition (IJDAR);2018-03-08
3. Text and Graphics Analysis in Engineering Drawings;Analysis of Engineering Drawings and Raster Map Images;2013-07-13
4. Introduction;Analysis of Engineering Drawings and Raster Map Images;2013-07-13
5. Use of colour for hand-filled form analysis and recognition;Pattern Analysis and Applications;2005-07-22