Affiliation:
1. School of Engineering Science, Simon Fraser University, Canada
2. Computer Science Department, University of California, Los Angeles, United States
Abstract
The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource on FPGAs. However, they often overlook the memory access optimizations on FPGA platforms and only achieve a marginal speedup over a multi-thread CPU implementation for large datasets.
In this article, we design and implement CHIP-KNN: an HLS-based, configurable, and high-performance KNN accelerator. CHIP-KNN optimizes the off-chip memory access on modern HBM-based FPGAs such as the AMD/Xilinx Alveo U280 FPGA board. CHIP-KNN is configurable for all essential parameters used in the algorithm, including the size of the search dataset, the feature dimension and data type representation of each data point, the distance metric, and the number of nearest neighbors - K. In terms of design architecture, we explore and discuss the tradeoffs between two design versions: CHIP-KNNv1 (Ping-Pong buffer based) and CHIP-KNNv2 (streaming-based). Moreover, we investigate the routing congestion issue in our accelerator design, implement hierarchical structures to shorten critical paths, and integrate an open-source floorplanning optimization tool called TAPA/AutoBridge to eliminate the place-and-route issues. To explore the design space and balance the computation and memory access performance, we also build an analytical performance model. Given a user configuration of the KNN parameters, our tool can automatically generate TAPA HLS C code for the optimal accelerator design and the corresponding host code, on the HBM-based FPGA platform.
Our experimental results on the Alveo U280 show that, compared to a 48-thread CPU implementation, CHIP-KNNv2 achieves a geomean performance speedup of 15×, with a maximum speedup of 45×. Additionally, we show that CHIP-KNNv2 achieves up to 2.1× performance speedup over CHIP-KNNv1 while increasing configurability. Compared with the state-of-the-art Facebook AI Similarity Search (FAISS) [
23
] GPU implementation running on a Nvidia Tesla V100 GPU, CHIP-KNNv2 achieves an average latency reduction of 30.6× while requiring 34.3% of GPU power consumption.
Funder
NSERC Discovery
Canada Foundation for Innovation John R. Evans Leaders Fund and British Columbia Knowledge Development Fund
Simon Fraser University New Faculty Start-up
Huawei, Xilinx, and Nvidia
Publisher
Association for Computing Machinery (ACM)
Reference44 articles.
1. Accelerated Approximate Nearest Neighbors Search Through Hierarchical Product Quantization
2. An introduction to kernel and nearest-neighbor nonparametric regression;Altman N. S.;The American Statistician,1992
3. G. Aparício, I. Blanquer, and V. Hernández. 2007. A parallel implementation of the K nearest neighbors classifier in three levels: Threads, MPI processes and the grid. In Proceedings of the High Performance Computing for Computational Science.Michel Daydé, José M. L. M. Palma, Álvaro L. G. A. Coutinho, Esther Pacitti, and João Correia Lopes (Eds.), Springer, Berlin, 225–235.
4. Sunil Arya and David M. Mount. 1998. ANN: Library for approximate nearest neighbor searching. In Proceedings of the IEEE CGC Workshop on Computational Geometry. 33–40.
5. An optimal algorithm for approximate nearest neighbor searching fixed dimensions
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献