Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach-Reference-Cited by-同舟云学术

Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach

Published:2015-09-29 Issue: Volume:3 Page:e1279
ISSN:2167-8359
Container-title:PeerJ
language:en
Short-container-title:

Author:

Mouriño García Marcos Antonio¹,Pérez Rodríguez Roberto¹,Anido Rifón Luis E.¹

Affiliation:

1. Department of Telematics Engineering, University of Vigo, Vigo, Spain

Abstract

Automatic classification of text documents into a set of categories has a lot of applications. Among those applications, the automatic classification of biomedical literature stands out as an important application for automatic document classification strategies. Biomedical staff and researchers have to deal with a lot of literature in their daily activities, so it would be useful a system that allows for accessing to documents of interest in a simple and effective way; thus, it is necessary that these documents are sorted based on some criteria—that is to say, they have to be classified. Documents to classify are usually represented following the bag-of-words (BoW) paradigm. Features are words in the text—thus suffering from synonymy and polysemy—and their weights are just based on their frequency of occurrence. This paper presents an empirical study of the efficiency of a classifier that leverages encyclopedic background knowledge—concretely Wikipedia—in order to create bag-of-concepts (BoC) representations of documents, understanding concept as “unit of meaning”, and thus tackling synonymy and polysemy. Besides, the weighting of concepts is based on their semantic relevance in the text. For the evaluation of the proposal, empirical experiments have been conducted with one of the commonly used corpora for evaluating classification and retrieval of biomedical information, OHSUMED, and also with a purpose-built corpus of MEDLINE biomedical abstracts, UVigoMED. Results obtained show that the Wikipedia-based bag-of-concepts representation outperforms the classical bag-of-words representation up to 157% in the single-label classification problem and up to 100% in the multi-label problem for OHSUMED corpus, and up to 122% in the single-label classification problem and up to 155% in the multi-label problem for UVigoMED corpus.

Funder

Galician Regional Government

REDPLIR (Red Gallega de Procesamiento del Lenguaje y Recuperacion de Informacion)

Publisher

PeerJ

Subject

General Agricultural and Biological Sciences,General Biochemistry, Genetics and Molecular Biology,General Medicine,General Neuroscience

Link

https://peerj.com/articles/1279.pdf

Reference49 articles.

1. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program;Aronson;AMIA Annual Symposium Proceedings,2001

2. Latent Dirichlet Allocation;Blei;Journal of Machine Learning Research,2003

3. Multiset theory;Blizard;Notre Dame Journal of Formal Logic,1988

4. Boosting for text classification with semantic features;Bloehdorn,2004

5. The Unified Medical Language System (UMLS): integrating biomedical terminology;Bodenreider;Nucleic Acids Research,2004

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MeSH-Based Semantic Weighting Scheme to Enhance Document Indexing: Application on Biomedical Document Classification;Journal of Information & Knowledge Management;2024-03-21

2. Improved class-specific vector for biomedical question type classification;International Journal of Computational Science and Engineering;2023

3. Production, Economics, and Marketing of Yeast Single Cell Protein;Food Microbiology Based Entrepreneurship;2023

4. BioMDSE: A Multimodal Deep Learning-Based Search Engine Framework for Biofilm Documents Classifications;2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2022-12-06

5. Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome;Neurocomputing;2021-11