Affiliation:
1. Illinois Institute of Technology, Chicago, IL
Abstract
We present a new algorithm for duplicate document detection that
uses collection statistics. We compare our approach with the
state-of-the-art approach using multiple collections. These
collections include a 30 MB 18,577 web document collection
developed by Excite@Home and three NIST collections. The first NIST
collection consists of 100 MB 18,232 LA-Times documents, which is
roughly similar in the number of documents to the
Excite&at;Home collection. The other two collections are both 2
GB and are the 247,491-web document collection and the TREC disks 4
and 5---528,023 document collection. We show that our approach
called I-Match, scales in terms of the number of documents and
works well for documents of all sizes. We compared our solution to
the state of the art and found that in addition to improved
accuracy of detection, our approach executed in roughly one-fifth
the time.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Reference23 articles.
1. Baeza-Yates R. and Ribeiro-Neto B. 1999. Modern Information Retrieval. Addison Wesley. Baeza-Yates R. and Ribeiro-Neto B. 1999. Modern Information Retrieval. Addison Wesley.
2. Brin S. Davis J. and Garcia-Molina H. 1995. Copy Detection Mechanisms for Digital Documents. In Proceeding of the Special Interest Group on Management of Data (SIGMOD'95) (San Francisco CA. May). 298--409. 10.1145/223784.223855 Brin S. Davis J. and Garcia-Molina H. 1995. Copy Detection Mechanisms for Digital Documents. In Proceeding of the Special Interest Group on Management of Data (SIGMOD'95) (San Francisco CA. May). 298--409. 10.1145/223784.223855
Cited by
113 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献