Building a training dataset for classification under a cost limitation-Reference-Cited by-同舟云学术

Building a training dataset for classification under a cost limitation

Published:2021-02-24 Issue:1 Volume:39 Page:77-96
ISSN:0264-0473
Container-title:The Electronic Library
language:en
Short-container-title:EL

Author:

Chen Yen-Liang,Cheng Li-Chen,Zhang Yi-Jun

Abstract

Purpose A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified. Because each document differs in length and complexity, the cost of labeling each document is different. The purpose of this paper is to consider how to select a subset of documents for labeling with a limited budget so that the total cost of the spending does not exceed the budget limit, while at the same time building a classifier with the best classification results. Design/methodology/approach In this paper, a framework is proposed to select the instances for labeling that integrate two clustering algorithms and two centroid selection methods. From the selected and labeled instances, five different classifiers were constructed with good classification accuracy to prove the superiority of the selected instances. Findings Experimental results show that this method can establish a training data set containing the most suitable data under the premise of considering the cost constraints. The data set considers both “data representativeness” and “data selection cost,” so that the training data labeled by experts can effectively establish a classifier with high accuracy. Originality/value No previous research has considered how to establish a training set with a cost limit when each document has a distinct labeling cost. This paper is the first attempt to resolve this issue.

Publisher

Emerald

Subject

Library and Information Sciences,Computer Science Applications

Reference41 articles.

1. A survey of text classification algorithms;Mining Text Data,2012

2. Active learning: a survey,2014

3. A new hybrid semi-supervised algorithm for text classification with class-based semantics;Knowledge-Based Systems,2016

4. Semi-automatic data annotation guided by feature space projection;Pattern Recognition,2021

5. Efficient agglomerative hierarchical clustering;Expert Systems with Applications,2015

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Identification effect of least square fitting method in archives management;Heliyon;2023-09

2. Classifier Construction Under Budget Constraints;Proceedings of the 2022 International Conference on Management of Data;2022-06-10