Affiliation:
1. Albert Ludwigs University, Freiburg, Germany
Abstract
We consider the problem of fuzzy full-text search in large text collections, that is, full-text search which is robust against errors both on the side of the query as well as on the side of the documents. Standard inverted-index techniques work extremely well for ordinary full-text search but fail to achieve interactive query times (below 100 milliseconds) for fuzzy full-text search even on moderately-sized text collections (above 10 GBs of text). We present new preprocessing techniques that achieve interactive query times on large text collections (100 GB of text, served by a single machine). We consider two similarity measures, one where the query terms match similar terms in the collection (e.g., algorithm matches algoritm or vice versa) and one where the query terms match terms with a similar prefix in the collection (e.g., alori matches algorithm). The latter is important when we want to display results instantly after each keystroke (search as you type). All algorithms have been fully integrated into the CompleteSearch engine.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Online to Offline Crossover of White Supremacist Propaganda;Companion Proceedings of the ACM Web Conference 2023;2023-04-30
2. Efficient Top-k Keyword Search in Relational Databases Considering Maximum Integrated Candidate Network (MICN);2022 8th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS);2022-12-28
3. Quality Evaluation for Documental Big Data;Proceedings of the 22nd International Conference on Enterprise Information Systems;2020
4. BEVA;ACM Transactions on Database Systems;2016-04-07
5. Context-Aware Approximate String Matching for Large-Scale Real-Time Entity Resolution;2015 IEEE International Conference on Data Mining Workshop (ICDMW);2015-11