Affiliation:
1. Italian National Council of Research
2. Qatar Computing Research Institute, Doha, Qatar
Abstract
Semiautomated Text Classification
(SATC) may be defined as the task of ranking a set
D
of automatically labelled textual documents in such a way that, if a human annotator validates (i.e., inspects and corrects where appropriate) the documents in a top-ranked portion of
D
with the goal of increasing the overall labelling accuracy of
D
, the expected increase is maximized. An obvious SATC strategy is to rank
D
so that the documents that the classifier has labelled with the lowest confidence are top ranked. In this work, we show that this strategy is suboptimal. We develop new utility-theoretic ranking methods based on the notion of
validation gain
, defined as the improvement in classification effectiveness that would derive by validating a given automatically labelled document. We also propose a new effectiveness measure for SATC-oriented ranking methods, based on the expected reduction in classification error brought about by partially validating a list generated by a given ranking method. We report the results of experiments showing that, with respect to the baseline method mentioned earlier, and according to the proposed measure, our utility-theoretic ranking methods can achieve substantially higher expected reductions in classification error.
Publisher
Association for Computing Machinery (ACM)
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献