Estimating the selectivity of tf-idf based cosine similarity predicates-Reference-Cited by-同舟云学术

Estimating the selectivity of tf-idf based cosine similarity predicates

Published:2007-06 Issue:2 Volume:36 Page:7-12
ISSN:0163-5808
Container-title:ACM SIGMOD Record
language:en
Short-container-title:SIGMOD Rec.

Author:

Tata Sandeep¹,Patel Jignesh M.¹

Affiliation:

1. University of Michigan, Ann Arbor, Michigan

Abstract

An increasing number of database applications today require sophisticated approximate string matching capabilities. Examples of such application areas include data integration and data cleaning. Cosine similarity has proven to be a robust metric for scoring the similarity between two strings, and it is increasingly being used in complex queries. An immediate challenge faced by current database optimizers is to find accurate and efficient methods for estimating the selectivity of cosine similarity predicates. To the best of our knowledge, there are no known methods for this problem. In this paper, we present the first approach for estimating the selectivity of tf.idf based cosine similarity predicates. We evaluate our approach on three different real datasets and show that our method often produces estimates that are within 40% of the actual selectivity.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/1328854.1328855

Reference10 articles.

1. Digital Bibliography and Library Project (DBLP) http://dblp.uni-trier.de/. Digital Bibliography and Library Project (DBLP) http://dblp.uni-trier.de/.

2. Using q-grams in a DBMS for approximate string processing;Gravano L.;IEEE Data Engineering Bulletin,2001

3. Text joins for data cleansing and integration in an RDBMS

4. Text joins in an RDBMS for web data integration

Cited by 111 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Evolving energy landscapes: A computational analysis of the determinants of energy poverty;Renewable and Sustainable Energy Reviews;2024-09

2. Exploring the Role of Dietary Fiber in Modulating Treatment Outcomes for Cancer Patients: A Topic Modeling Approach;2024-06-25

3. Closer in time and higher correlation: disclosing the relationship between citation similarity and citation interval;Scientometrics;2024-06-20

4. Industrial Semiconductor GPT: A Question-and-Answer System that Provides Professional Advice and Problem-Solving Methods for Semiconductor and Factory Equipment and Process;2024 IEEE 33rd International Symposium on Industrial Electronics (ISIE);2024-06-18

5. A Generative AI-Based Assistant to Evaluate Short and Long Answer Questions;SN Computer Science;2024-06-10