Collection statistics for fast duplicate document detection-Reference-Cited by-同舟云学术

Collection statistics for fast duplicate document detection

Published:2002-04 Issue:2 Volume:20 Page:171-191
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Chowdhury Abdur¹,Frieder Ophir¹,Grossman David¹,McCabe Mary Catherine¹

Affiliation:

1. Illinois Institute of Technology, Chicago, IL

Abstract

We present a new algorithm for duplicate document detection that uses collection statistics. We compare our approach with the state-of-the-art approach using multiple collections. These collections include a 30 MB 18,577 web document collection developed by Excite@Home and three NIST collections. The first NIST collection consists of 100 MB 18,232 LA-Times documents, which is roughly similar in the number of documents to the Excite&at;Home collection. The other two collections are both 2 GB and are the 247,491-web document collection and the TREC disks 4 and 5---528,023 document collection. We show that our approach called I-Match, scales in terms of the number of documents and works well for documents of all sizes. We compared our solution to the state of the art and found that in addition to improved accuracy of detection, our approach executed in roughly one-fifth the time.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/506309.506311

Reference23 articles.

1. Baeza-Yates R. and Ribeiro-Neto B. 1999. Modern Information Retrieval. Addison Wesley. Baeza-Yates R. and Ribeiro-Neto B. 1999. Modern Information Retrieval. Addison Wesley.

2. Brin S. Davis J. and Garcia-Molina H. 1995. Copy Detection Mechanisms for Digital Documents. In Proceeding of the Special Interest Group on Management of Data (SIGMOD'95) (San Francisco CA. May). 298--409. 10.1145/223784.223855 Brin S. Davis J. and Garcia-Molina H. 1995. Copy Detection Mechanisms for Digital Documents. In Proceeding of the Special Interest Group on Management of Data (SIGMOD'95) (San Francisco CA. May). 298--409. 10.1145/223784.223855

Cited by 113 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation;Proceedings of the ACM on Management of Data;2023-06-13

2. Analyzing Techniques for Duplicate Question Detection on Q&A Websites for Game Developers;Empirical Software Engineering;2022-12-08

3. Design of Methodology and a Comparative Analysis of Trigram Technique in Similarity of Textual Data;Communications in Computer and Information Science;2021

4. Detecting Document Versions and Their Ordering in a Collection;Web Information Systems Engineering – WISE 2021;2021

5. Improved Streaming Quotient Filter: A Duplicate Detection Approach for Data Streams;The International Arab Journal of Information Technology;2020-09-01