Affiliation:
1. University of California, Davis, USA
2. University of South Florida, USA
Abstract
Classification is a form of data analysis that can be used to extract models to predict categorical class labels (Han & Kamber, 2001). Data classification has proven to be very useful in a wide variety of applications. For example, a classification model can be built to categorize bank loan applications as either safe or risky. In order to build a classification model, training data containing multiple independent variables and a dependant variable (class label) is needed. If a data record has a known value for its class label, this data record is termed “labeled”. If the value for its class is unknown, it is “unlabeled”. There are situations with a large amount of unlabeled data and a small amount of labeled data. Using only labeled data to build classification models can potentially ignore useful information contained in the unlabeled data. Furthermore, unlabeled data can often be much cheaper and more plentiful than labeled data, and so if useful information can be extracted from it that reduces the need for labeled examples, this can be a significant benefit (Balcan & Blum 2005). The default practice is to use only the labeled data to build a classification model and then assign class labels to the unlabeled data. However, when the amount of labeled data is not enough, the classification model built only using the labeled data can be biased and far from accurate. The class labels assigned to the unlabeled data can then be inaccurate. How to leverage the information contained in the unlabeled data to help improve the accuracy of the classification model is an important research question. There are two streams of research that addresses the challenging issue of how to appropriately use unlabeled data for building classification models. The details are discussed below.
Reference17 articles.
1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (1995). Fast Discovery of Association Rules. In Advances in Knowledge Discovery and Data Mining, AAAI Press.
2. The usefulness of optimum experimental designs.;A.Atkinson;Journal of the Royal Statistical Society. Series A, (Statistics in Society),1996
3. Balcan, M., & Blum, A. (2005). A PAC-Style Model for Learning from Labeled and Unlabeled Data. In Proceedings of the 18th Annual Conference on Learning Theory.
4. Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proceedings of the 1998 Conference on Computational Learning Theory.
5. Cluster Kernels for Semi-Supervised Learning.;O.Chapelle;Advances in Neural Information Processing Systems,2003