Abstract
In this paper, a comprehensive performance analysis of duplicate data detection techniques for relational databases has been performed. The research focuses on traditional SQL based and modern bloom filter techniques to find and eliminate records which already exist in the database while performing bulk insertion operation (i.e. bulk insertion involved in the loading phase of the Extract, Transform, and Load (ETL) process and data synchronization in multisite database synchronization). The comprehensive performance analysis was performed on several data sizes using SQL, bloom filter, and parallel bloom filter. The results show that the parallel bloom filter is highly suitable for duplicate detection in the database.
Publisher
Engineering, Technology & Applied Science Research
Reference11 articles.
1. A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, “Duplicate record detection: A survey”, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 1, pp. 1-16, 2007
2. O. H. Akel, A Comparative Study of Duplicate Record Detection Techniques, MSc Thesis, Middle East University, 2012
3. B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors”, Communications of the ACM, Vol. 13, No. 7, pp. 422-426, 1970
4. L. Fan, P. Cao, J. Almeida, A. Z. Broder, “Summary cache: a scalable wide-area web cache sharing protocol”, IEEE/ACM Transactions on Networking, Vol. 8, No. 3, pp. 281-293, 2000
5. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, G. Varghese, “An improved construction for counting bloom filters”, in: European Symposium on Algorithms, Springer, pp. 684-695, 2006
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献