Affiliation:
1. Hong Kong Polytechnic University, Hong Kong
Abstract
There has been a large amount of research work done on mining on relational databases that store data in exact values. However, in many real-life applications such as those commonly used in service industry, the raw data are usually uncertain when they are collected or produced. Sources of uncertain data include readings from sensors (such as RFID tagged in products in retail stores), classification results (e.g., identities of products or customers) of image processing using statistical classifiers, results from predictive programs used for stock market or targeted marketing as well as predictive churn model in customer relationship management. However, since traditional databases only store exact values, uncertain data are usually transformed into exact data by, for example, taking the mean value (for quantitative attributes) or by taking the value with the highest frequency or possibility. The shortcomings are obvious: (1) by approximating the uncertain source data values, the results from the mining tasks will also be approximate and may be wrong; (2) useful probabilistic information may be omitted from the results. Research on probabilistic databases began in 1980s. While there has been a great deal of work on supporting uncertainty in databases, there is increasing work on mining on such uncertain data. By classifying uncertain data into different categories, a framework is proposed to develop different probabilistic data mining techniques that can be applied directly on uncertain data in order to produce results that preserve the accuracy. In this chapter, we introduce the framework with a scheme to categorize uncertain data with different properties. We also propose a variety of definitions and approaches for different mining tasks on uncertain data with different properties. The advances in data mining application in this aspect are expected to improve the quality of services provided in various service industries.
Reference38 articles.
1. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In Proc. of International Very Large Databases Conference (pp. 589–598).
2. The management of probabilistic data
3. Cluster Validity with Fuzzy Sets
4. Cavallo, R., & Pittarelli, M. (1987). The theory of probabilistic databases. In Proceedings of the 13th International Conference on Very Large Data Bases (pp. 71-81).
5. Chui, C. K., Kao, B., & Hung, E. (2007). Mining frequent itemsets from uncertain data. In Proc. of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2007).