Parsing of Research Documents into XML Using Formal Grammars-Reference-Cited by-同舟云学术

Parsing of Research Documents into XML Using Formal Grammars

Published:2024-01 Issue:1 Volume:2024 Page:
ISSN:1687-9724
Container-title:Applied Computational Intelligence and Soft Computing
language:en
Short-container-title:Applied Computational Intelligence and Soft Computing

Author:

Iwashokun Opeoluwa^ORCID,Ade-Ibijola Abejide^ORCID

Abstract

Automatic information extraction of content and style format in paged documents is challenging. It requires the conversion of the original document into a granular level of details for which every document section and content is identifiable. This functionality or tool does not exist for any academic research document yet. In this paper, we present an automated process of parsing research paper documents into XML files using a formal method approach of context‐free grammars (CFGs) and regular expressions (REGEXs) definable of a standard template. We created a tool for the algorithms to parse these documents into tree‐like structures organized as XML files named research_XML (RX) parser. The RX tool performed the extraction of syntactic structure and semantic information of the document’s contents into XML files. These XML output files are lightweight, analyzable, query‐able, and web interoperable. The RX tool has a success rate of 91% when evaluated on fifty varying research documents of 160 average pages and 8,004 total pages. The tool and test data are accessible on GitHub repo. The novelty of our process is specific to applying formal techniques for information extraction in structured multipaged documents and academic research documents thus advancing the research in automatic information extraction.

Funder

National Research Foundation

Publisher

Wiley

Link

http://downloads.hindawi.com/journals/acisc/2024/6671359.pdf

Reference61 articles.

1. MajiS. AppeA. BaliR. ChowdhuryA. G. RaghavendraV. C. andBhandaruV. M. An interpretable deep learning system for automatically scoring request for proposals Proceedings of the 2021 IEEE 33rd International Conference on Tools with Artificial Intelligence (ICTAI) November 2021 Washington DC USA IEEE 851–855.

2. PalmR. B. LawsF. andWintherO. Attend copy parse end-to-end information extraction from documents Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR) September 2019 Sydney NSW Australia 329–336 https://doi.org/10.1109/ICDAR.2019.00060.

3. GralińskiF. StanisławekT. WróblewskaA. LipińskiD. KaliskaA. RosalskaP. TopolskiB. andBiecekP. Kleister: a novel task for information extraction involving long documents with complex layout 2020 https://arxiv.org/abs/2003.02356.

4. Procedure Parsing: A Method for Parsing Handwritten Documents into Computer-Based Procedures

5. Information extraction from semi and unstructured data sources: a systematic literature review;Zaman G.;ICIC Express Letters,2020