Affiliation:
1. AT&T Bell Laboratories, 600 Mountain Avenue, Room 2C-322, Murray Hill, New Jersey 07974–0636, USA
Abstract
A method for analyzing the structure of the white background in document images is described, along with applications to the problem of isolating blocks of machine-printed text. The approach is based on computational-geometry algorithms for off-line enumeration of maximal white rectangles and on-line rectangle unification. These support a fast, simple, and general heuristic for geometric layout segmentation, in which white space is covered greedily by rectangles until all text blocks are isolated. Design of the heuristic can be substantially automated by an analysis of the empirical statistical distribution of properties of covering rectangles: for example, the stopping rule can be chosen by Rosenblatt’s perceptron training algorithm. Experimental trials show good behavior on the large and useful class of textual Manhattan layouts. On complex layouts from English-language technical journals of many publishers, the method finds good segmentations in a uniform and nearly parameter-free manner. On a variety of non-Latin texts, some with vertical text lines, the method finds good segmentations without prior knowledge of page and text-line orientation.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Computer Vision and Pattern Recognition,Software
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献