Author:
Chen Xingyuan,Xia Yunqing,Jin Peng,Carroll John
Abstract
Manually labeling documents for training a text classifier is expensive and time-consuming. Moreover, a classifier trained on labeled documents may suffer from overfitting and adaptability problems. Dataless text classification (DLTC) has been proposed as a solution to these problems, since it does not require labeled documents. Previous research in DLTC has used explicit semantic analysis of Wikipedia content to measure semantic distance between documents, which is in turn used to classify test documents based on nearest neighbours. The semantic-based DLTC method has a major drawback in that it relies on a large-scale, finely-compiled semantic knowledge base, which is difficult to obtain in many scenarios. In this paper we propose a novel kind of model, descriptive LDA (DescLDA), which performs DLTC with only category description words and unlabeled documents. In DescLDA, the LDA model is assembled with a describing device to infer Dirichlet priors from prior descriptive documents created with category description words. The Dirichlet priors are then used by LDA to induce category-aware latent topics from unlabeled documents. Experimental results with the 20Newsgroups and RCV1 datasets show that: (1) our DLTC method is more effective than the semantic-based DLTC baseline method; and (2) the accuracy of our DLTC method is very close to state-of-the-art supervised text classification methods. As neither external knowledge resources nor labeled documents are required, our DLTC method is applicable to a wider range of scenarios.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. From Text to Context: An Entailment Approach for News Stakeholder Classification;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
2. TRGNN: Text-Rich Graph Neural Network for Few-Shot Document Filtering;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30
3. RulePrompt: Weakly Supervised Text Classification with Prompting PLMs and Self-Iterative Logical Rules;Proceedings of the ACM Web Conference 2024;2024-05-13
4. Keyword-Based Feedback for Interactive Document Classification;2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE);2024-05-10
5. Weak-PMLC: A large-scale framework for multi-label policy classification based on extremely weak supervision;Information Processing & Management;2023-09