Auto-CORPus: A Natural Language Processing Tool for Standardising and Reusing Biomedical Literature-Reference-Cited by-同舟云学术

Auto-CORPus: A Natural Language Processing Tool for Standardising and Reusing Biomedical Literature

Published:2021-01-08 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Beck Tim,Shorter Tom,Hu Yan,Li Zhuoyu,Sun Shujian,Popovici Casiana M.,McQuibban Nicholas A. R.,Makraduli Filip,Yeung Cheng S.,Rowlands Thomas,Posma Joram M.^ORCID

Abstract

AbstractTo analyse large corpora using machine learning and other Natural Language Processing (NLP) algorithms, the corpora need to be standardised. The BioC format is a community-driven simple data structure for sharing text and annotations, however there is limited access to biomedical literature in BioC format and a lack of bioinformatics tools to convert online publication HTML formats to BioC. We present Auto-CORPus (Automated pipeline for Consistent Outputs from Research Publications), a novel NLP tool for the standardisation and conversion of publication HTML and table image files to three convenient machine-interpretable outputs to support biomedical text analytics. Firstly, Auto-CORPus can be configured to convert HTML from various publication sources to BioC. To standardise the description of heterogenous publication sections, the Information Artifact Ontology is used to annotate each section within the BioC output. Secondly, Auto-CORPus transforms publication tables to a JSON format to store, exchange and annotate table data between text analytics systems. The BioC specification does not include a data structure for representing publication table data, so we present a JSON format for sharing table content and metadata. Inline tables within full-text HTML files and linked tables within separate HTML files are processed and converted to machine-interpretable table JSON format. Finally, Auto-CORPus extracts abbreviations declared within publication text and provides an abbreviations JSON output that relates an abbreviation with the full definition. This abbreviation collection supports text mining tasks such as named entity recognition by including abbreviations unique to individual publications that are not contained within standard bio-ontologies and dictionaries.AvailabilityThe Auto-CORPus package is freely available with detailed instructions from Github at https://github.com/omicsNLP/Auto-CORPus/.

Publisher

Cold Spring Harbor Laboratory

Reference24 articles.

1. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review

2. Natural language processing to extract symptoms of severe mental illness from clinical text: the Clinical Record Interactive Search Comprehensive Data Extraction (CRIS-CODE) project

3. Status of text-mining techniques applied to biomedical text

4. Lucy Lu Wang , Isabel Cachola , Jonathan Bragg , Evie Yu-Yen Cheng , Chelsea Haupt , Matt Latzke , Bailey Kuehl , Madeleine van Zuylen , Linda Wagner , and Daniel S. Weld . Improving the accessibility of scientific documents: Current state, user needs, and a system solution to enhance scientific pdf accessibility for blind and low vision users, 2021.

5. BioC: a minimalist approach to interoperability for biomedical text processing

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An automatic system for extracting figure-caption pair from medical documents: a six-fold approach;PeerJ Computer Science;2023-07-26

2. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022;npj Digital Medicine;2022-12-21