Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents-Reference-Cited by-同舟云学术

Implicit Semantics Based Metadata Extraction and Matching of Scholarly Documents

Published:2018-04 Issue:2 Volume:29 Page:1-22
ISSN:1063-8016
Container-title:Journal of Database Management
language:en
Short-container-title:

Author:

Jiang Congfeng¹,Liu Junming¹,Ou Dongyang¹,Wang Yumei¹,Yu Lifeng²

Affiliation:

1. School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China

2. Hithink RoyalFlush Information Network Co., Ltd., Hangzhou, China

Abstract

The authors propose to use formatting templates and implicit formatting semantics information for automatic metadata identification and segmentation. The pure texts and their corresponding formatting information including line height, font type, and font size, are recognized in parallel to guide metadata identification. The authors use implicit formatting semantics, such as the change of formatting, formatting templates and implications, explicit formatting layouts, as well as predefined frequently occurred keywords database to increase the extraction accuracy. Unlike other OCR-based approaches, the authors use open source PDFBox package as the basic preprocessing tool to get pure texts and formatting values of the document contents. On top of PDFBox they built their own pipeline program, namely, PAXAT, to implement their approaches for metadata extraction. 10177 papers from arXiv, ACM, ACL and other publicly accessed and institution-subscribed sources are tested. The overall extraction accuracy of title, authors, affiliations, author-affiliation matching are 0.9798, 0.9425, 0.9298, and 0.9109, respectively.

Publisher

IGI Global

Subject

Hardware and Architecture,Information Systems,Software

Reference49 articles.

1. Beel, J., Gipp, B., Shaker, A., & Friedrich, N. (2010). Sciplore xtract: Extracting titles from scientific pdf documents by analyzing style information (font size). In M. Lalmas et al. (Eds.), Proceedings of The European Conference on Digital Libraries, Lecture Notes in Computer Science (pp. 413-416). Springer-Verlag.

2. Docear's PDF inspector

3. Metadata for digital libraries: Architecture and design rationale.;C.-C. K.Chang;Proceedings of the 2nd ACM International Conference on Digital Libraries,1997

4. Automatic Extraction of Figures from Scholarly Documents

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Text Detection Model for Historical Documents Using CNN and MSER;Journal of Database Management;2023-04-21

2. An overview of cluster-based image search result organization: background, techniques, and ongoing challenges;Knowledge and Information Systems;2022-02-11

3. An Ontological Framework for Information Extraction From Diverse Scientific Sources;IEEE Access;2021

4. Research on Methodology of Correlation Analysis of Sci-Tech Literature Based on Deep Learning Technology in the Big Data;Deep Learning and Neural Networks;2020

5. Collaboration Matrix Factorization on Rate and Review for Recommendation;Journal of Database Management;2019-04