Affiliation:
1. University of Liverpool, United Kingdom
2. RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
3. National Institute of Informatics, Tokyo, Japan
Abstract
Short and sparse texts such as tweets, search engine snippets, product reviews, and chat messages are abundant on the Web. Classifying such short-texts into a pre-defined set of categories is a common problem that arises in various contexts, such as sentiment classification, spam detection, and information recommendation. The fundamental problem in short-text classification is
feature sparseness
-- the lack of feature overlap between a trained model and a test instance to be classified. We propose
ClassiNet
-- a network of classifiers trained for predicting missing features in a given instance, to overcome the feature sparseness problem. Using a set of unlabeled training instances, we first learn binary classifiers as feature predictors for predicting whether a particular feature occurs in a given instance. Next, each feature predictor is represented as a vertex
v
i
in the ClassiNet, where a one-to-one correspondence exists between feature predictors and vertices. The weight of the directed edge
e
ij
connecting a vertex
v
i
to a vertex
v
j
represents the conditional probability that given
v
i
exists in an instance,
v
j
also exists in the same instance.
We show that ClassiNets generalize word co-occurrence graphs by considering implicit co-occurrences between features. We extract numerous features from the trained ClassiNet to overcome feature sparseness. In particular, for a given instance
x
, we find similar features from ClassiNet that did not appear in
x
, and append those features in the representation of
x
. Moreover, we propose a method based on graph propagation to find features that are indirectly related to a given short-text. We evaluate ClassiNets on several benchmark datasets for short-text classification. Our experimental results show that by using ClassiNet, we can statistically significantly improve the accuracy in short-text classification tasks, without having to use any external resources such as thesauri for finding related features.
Funder
ERATO Kawarabayashi Large Graph Project from the Japan Science and Technology Agency
Publisher
Association for Computing Machinery (ACM)
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献