Figure and caption extraction from biomedical documents-Reference-Cited by-同舟云学术

Figure and caption extraction from biomedical documents

Published:2019-04-05 Issue:21 Volume:35 Page:4381-4388
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Li Pengyuan¹,Jiang Xiangying¹,Shatkay Hagit¹

Affiliation:

1. Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA

Abstract

Abstract Motivation Figures and captions convey essential information in biomedical documents. As such, there is a growing interest in mining published biomedical figures and in utilizing their respective captions as a source of knowledge. Notably, an essential step underlying such mining is the extraction of figures and captions from publications. While several PDF parsing tools that extract information from such documents are publicly available, they attempt to identify images by analyzing the PDF encoding and structure and the complex graphical objects embedded within. As such, they often incorrectly identify figures and captions in scientific publications, whose structure is often non-trivial. The extraction of figures, captions and figure-caption pairs from biomedical publications is thus neither well-studied nor yet well-addressed. Results We introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike existing methods, we first separate between text and graphical contents, and then utilize layout information to effectively detect and extract figures and captions. We generate files containing the figures and their associated captions and provide those as output to the end-user. We test our system both over a public dataset of computer science documents previously used by others, and over two newly collected sets of publications focusing on the biomedical domain. Our experiments and results comparing PDFigCapX to other state-of-the-art systems show a significant improvement in performance, and demonstrate the effectiveness and robustness of our approach. Availability and implementation Our system is publicly available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX. The two new datasets are available at: https://www.eecis.udel.edu/~compbio/PDFigCapX/Downloads

Funder

National Institutes of Health

National Library of Medicine

Publisher

Oxford University Press (OUP)

Subject

Computational Mathematics,Computational Theory and Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Statistics and Probability

Link

http://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz228/28756676/btz228.pdf

Reference40 articles.

1. Mining biomedical images towards valuable information retrieval in biomedical and life sciences;Ahmed;Database,2016

2. Dynamic expression pattern of leucine-rich repeat neuronal protein 4 in the mouse dorsal root ganglia during development;Bando;Neurosci. Lett,2013

3. Text and non-text separation in offline document images: a survey;Bhowmik;IJDAR,2018

4. The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics;Blake;Nucleic Acids Res,2011

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Datasets and annotations for layout analysis of scientific articles;International Journal on Document Analysis and Recognition (IJDAR);2024-03-18

2. MouseScholar: Evaluating an Image+Text Search System for Biocuration;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

3. Drug discovery for COVID-19 and related mutations using artificial intelligence;Research Journal of Pharmacy and Technology;2023-11-30

4. Automated scholarly paper review: Concepts, technologies, and challenges;Information Fusion;2023-10

5. An automatic system for extracting figure-caption pair from medical documents: a six-fold approach;PeerJ Computer Science;2023-07-26