Abstract
Passage retrieval has been studied for decades, and many recent approaches of passage retrieval are using dense embeddings generated from deep neural models, called "dense passage retrieval". The state-of-the-art end-to-end dense passage retrieval systems normally deploy a deep neural model followed by an approximate nearest neighbor (ANN) search module. The model generates embeddings of the corpus and queries, which are then indexed and searched by the high-performance ANN module. With the increasing data scale, the ANN module unavoidably becomes the bottleneck on efficiency. An alternative is the learned index, which achieves significantly high search efficiency by learning the data distribution and predicting the target data location. But most of the existing learned indexes are designed for low dimensional data, which are not suitable for dense passage retrieval with high-dimensional dense embeddings.
In this paper, we propose
LIDER
, an efficient high-dimensional
L
earned
I
ndex for large-scale
DE
nse passage
R
etrieval. LIDER has a clustering-based hierarchical architecture formed by two layers of core models. As the basic unit of LIDER to index and search data, a
core model
includes an adapted recursive model index (RMI) and a dimension reduction component which consists of an extended SortingKeys-LSH (SK-LSH) and a key re-scaling module. The dimension reduction component reduces the high-dimensional dense embeddings into one-dimensional keys and sorts them in a specific order, which are then used by the RMI to make fast prediction. Experiments show that LIDER has a higher search speed with high retrieval quality comparing to the state-of-the-art ANN indexes on passage retrieval tasks, e.g., on large-scale data it achieves 1.2x search speed and significantly higher retrieval quality than the fastest baseline in our evaluation. Furthermore, LIDER has a better capability of speed-quality trade-off.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference44 articles.
1. Alexandr Andoni , Piotr Indyk , Thijs Laarhoven , Ilya Razenshteyn , and Ludwig Schmidt . 2015. Practical and optimal LSH for angular distance. Advances in neural information processing systems 28 ( 2015 ). Alexandr Andoni, Piotr Indyk, Thijs Laarhoven, Ilya Razenshteyn, and Ludwig Schmidt. 2015. Practical and optimal LSH for angular distance. Advances in neural information processing systems 28 (2015).
2. MarkedBERT: Integrating Traditional IR Cues in Pre-trained Language Models for Passage Retrieval
3. Similarity estimation techniques from rounding algorithms
4. Context-Aware Term Weighting For First Stage Passage Retrieval
5. Angjela Davitkova Evica Milchevski and Sebastian Michel. 2020. The ML-Index: A Multidimensional Learned Index for Point Range and Nearest-Neighbor Queries.. In EDBT. 407--410. Angjela Davitkova Evica Milchevski and Sebastian Michel. 2020. The ML-Index: A Multidimensional Learned Index for Point Range and Nearest-Neighbor Queries.. In EDBT. 407--410.
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献