Affiliation:
1. School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China
2. Hithink RoyalFlush Information Network Co., Ltd., Hangzhou, China
Abstract
The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.
Subject
Hardware and Architecture,Information Systems,Software
Reference49 articles.
1. Beel, J., Gipp, B., Shaker, A., & Friedrich, N. (2010). Sciplore xtract: Extracting titles from scientific pdf documents by analyzing style information (font size). In M. Lalmas et al. (Eds.), Proceedings of The European Conference on Digital Libraries, Lecture Notes in Computer Science (pp. 413-416). Springer-Verlag.
2. Docear's PDF inspector
3. Metadata for digital libraries: Architecture and design rationale.;C.-C. K.Chang;Proceedings of the 2nd ACM International Conference on Digital Libraries,1997
4. Automatic Extraction of Figures from Scholarly Documents
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献