Abstract
Dirty data exist in many systems. Efficient and effective management of dirty data is in demand. Since data cleaning may result in useful data lost and new dirty data, this research attempts to manage dirty data without cleaning and retrieve query result according to the quality requirement of users. Since entity is the unit for understanding objects in the world and many dirty data are led by different descriptions of the same real-world entity, this chapter defines the entity data model to manage dirty data and then it proposes EntityManager, a dirty data management system with entity as the basic unit, keeping conflicts in data as uncertain attributes. Even though the query language is SQL, the query in the system has different semantics on dirty data. To process queries efficiently, this research proposes novel index, data operator implementation, and query optimization algorithms for the system.
Reference44 articles.
1. Andritsos, P., Fuxman, A., & Miller, R. J. (2006). Clean answers over dirty databases: A probabilistic approach. Paper presented at the Data Engineering, 2006. New York, NY.
2. Arasu, A., Ganti, V., & Kaushik, R. (2006). Efficient exact set-similarity joins. Paper presented at the 32nd International Conference on Very Large Data Bases. New York, NY.
3. Behm, A., Ji, S., Li, C., & Lu, J. (2009). Space-constrained gram-based indexing for efficient approximate string search. Paper presented at the Data Engineering, 2009. New York, NY.
4. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., & Widom, J. (2009). Swoosh: A generic approach to entity resolution. The VLDB Journal—The International Journal on Very Large Data Bases, 18(1), 255-276.