Abstract
AbstractEfficiently exploiting all sources of information such as labeled instances, classes’ representation, and relations of them has a high impact on the performance of Multi-Label Text Classification (MLTC) systems. Most of the current approaches use labeled documents as the primary source of information for MLTC. We investigate the effectiveness of different sources of information— such as the labeled training data, textual labels of classes, and taxonomy relations of classes— for MLTC. More specifically, first, for each document–class pair, different features are extracted using different sources of information. The features reflect the similarity of classes and documents. Then, MLTC is considered to be a ranking problem, and a learning to rank (LTR) approach is used for ranking classes regarding documents and selecting labels of documents. An important characteristic of many MLTC instances is that documents can belong to multiple classes and there are implicit relations between classes. We apply score propagation on top of LTR to incorporate co-occurrence patterns of classes in labeled documents. Our main findings are the following. First, using an LTR approach integrating all features, we observe significantly better performance than previous systems for MLTC. Specifically, we show that simple classification approaches fail when there is a high number of classes. Second, the analysis of feature weights reveals the relative importance of various sources of evidence, also giving insight into the underlying classification problem. Interestingly, the results indicate that the titles of documents are more informative than all other sources of information. Third, a lean-and-mean system using only four features is able to perform at 96% of the large LTR model that we propose in this paper. Fourth, using the co-occurrence information of classes helps in classifying documents more accurately. Our results show that the co-occurrence information is more helpful when the underlying classifier has a poor performance.
Publisher
Cambridge University Press (CUP)
Subject
Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software
Reference52 articles.
1. Steinberger, R. , Ebrahim, M. and Turchi, M. (2012). JRC EuroVoc indexer JEX-A freely available multi-label categorisation tool. In Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC.
2. EuroVoc, . (2014). Multilingual thesaurus of the European Union. Available at http://eurovoc.europa.eu/
3. Daudaravicius, V. (2012). Automatic multilingual annotation of EU legislation with Eurovoc descriptors. In Proceedings of Exploring and Exploiting Official Publications Workshop Programme, EEOP2012, pp. 14–20.
4. Hierarchical multi-class text categorization with global margin maximization
5. Dynamic label propagation for semi-supervised multi-class multi-label classification
Cited by
24 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献