Author:
Gemelli Andrea,Marinai Simone,Pisaneschi Lorenzo,Santoni Francesco
Abstract
AbstractFor a long time now, datasets containing scientific articles have been crucial to the analysis and recognition of document images. These document collections have frequently served as a testing ground for cutting-edge methods for optical character recognition, layout analysis, and document understanding in general. We thoroughly analyze and compare many datasets proposed for layout analysis of scientific documents, ranging from small collections of scanned papers to modern large-scale datasets containing digital-born papers, which have been proposed to train deep learning-based methods. Furthermore, we outline a detailed taxonomy of the annotation procedures used considering manual, automatic, and generative approaches, and we analyze their benefits and drawbacks. This survey is meant to provide the reader with a review of the most used benchmarks together with detailed information on data, annotations, and complexity, helping scholars to identify the most suitable dataset for their tasks of interest. We also discuss possible open problems to further enhance datasets to support research in the layout analysis of scientific articles.
Funder
Università degli Studi di Firenze
Publisher
Springer Science and Business Media LLC
Reference113 articles.
1. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848
2. Grother, P.J.: NIST special database 19. Handprinted forms and characters database, National Institute of Standards and Technology 10 (1995)
3. Deng, L.: The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 29(6), 141–142 (2012)
4. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
5. Studies in Computational Intelligence;S Marinai,2008