Active Learning for Biomedical Article Classification with Bag of Words and FastText Embeddings

Author:

Cichosz Paweł1ORCID

Affiliation:

1. Institute of Computer Science, Warsaw University of Technology, 00-665 Warsaw, Poland

Abstract

In several applications of text classification, training document labels are provided by human evaluators, and therefore, gathering sufficient data for model creation is time consuming and costly. The labeling time and effort may be reduced by active learning, in which classification models are created based on relatively small training sets, which are obtained by collecting class labels provided in response to labeling requests or queries. This is an iterative process with a sequence of models being fitted, and each of them is used to select query articles to be added to the training set for the next one. Such a learning scenario may pose different challenges for machine learning algorithms and text representation methods used for text classification than ordinary passive learning, since they have to deal with very small, often imbalanced data, and the computational expense of both model creation and prediction has to remain low. This work examines how classification algorithms and text representation methods that have been found particularly useful by prior work handle these challenges. The random forest and support vector machines algorithms are coupled with the bag of words and FastText word embedding representations and applied to datasets consisting of scientific article abstracts from systematic literature review studies in the biomedical domain. Several strategies are used to select articles for active learning queries, including uncertainty sampling, diversity sampling, and strategies favoring the minority class. Confidence-based and stability-based early stopping criteria are used to generate active learning termination signals. The results confirm that active learning is a useful approach to creating text classification models with limited access to labeled data, making it possible to save at least half of the human effort needed to assign relevant or irrelevant class labels to training articles. Two of the four examined combinations of classification algorithms and text representation methods were the most successful: the SVM algorithm with the FastText representation and the random forest algorithm with the bag of words representation. Uncertainty sampling turned out to be the most useful query selection strategy, and confidence-based stopping was found more universal and easier to configure than stability-based stopping.

Publisher

MDPI AG

Reference66 articles.

1. McCallum, A., and Nigam, K. (1998, January 26–27). A Comparison of Event Models for Naive Bayes Text Classification. Proceedings of the AAAI/ICML-98 Workshop on Learning for Text Categorization, Menlo Park, CA, USA.

2. Joachims, T. (1998, January 21–23). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the Tenth European Conference on Machine Learning (ECML-1998), Berlin, Germany.

3. Text Mining: Approaches and Applications;Novi Sad J. Math.,2008

4. Rousseau, F., Kiagias, E., and Vazirgiannis, M. (2015, January 26–31). Text Categorization as a Graph Classification Problem. Proceedings of the Fifty-Third Annual Meeting of the Association for Computational Linguistics and the Sixth International Joint Conference on Natural Language Processing (ACL-IJCNLP-2015), Beijing, China.

5. Ensembles of Classifiers for Parallel Categorization of Large Number of Text Documents Expressing Opinions;J. Appl. Econ. Sci.,2017

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3