Realizing Efficient On-Device Language-based Image Retrieval-Reference-Cited by-同舟云学术

Realizing Efficient On-Device Language-based Image Retrieval

Published:2024-08-16 Issue:9 Volume:20 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Hu Zhiming¹^ORCID,Kemertas Mete²^ORCID,Xiao Lan³^ORCID,Phillips Caleb⁴^ORCID,Mohomed Iqbal¹^ORCID,Fazly Afsaneh¹^ORCID

Affiliation:

1. Samsung AI Centre, Toronto, Canada

2. University of Toronto, Toronto, Canada

3. Meta, Toronto, Canada

4. Recursion, Toronto, Canada

Abstract

Advances in deep learning have enabled accurate language-based search and retrieval (e.g., over user photos) in the cloud. Many users prefer to store their photos in the home due to privacy concerns. As such, a need arises for models that can perform cross-modal search on resource-limited devices. State-of-the-art (SOTA) cross-modal retrieval models achieve high accuracy through learning entangled representations that enable fine-grained similarity calculation between a language query and an image, but at the expense of having a prohibitively high retrieval latency. Alternatively, there is a new class of methods that exhibits good performance with low latency but requires a lot more computational resources and an order of magnitude more training data (i.e., large web-scraped datasets consisting of millions of image–caption pairs), making them infeasible to use in a commercial context. From a pragmatic perspective, none of the existing methods are suitable for developing commercial applications for low-latency cross-modal retrieval on low-resource devices. We propose CrispSearch, a cascaded approach that greatly reduces the retrieval latency with minimal loss in ranking accuracy for on-device language-based image retrieval. The idea behind our approach is to combine a light-weight and runtime-efficient coarse model with a fine re-ranking stage. Given a language query, the coarse model effectively filters out many of the irrelevant image candidates. After this filtering, only a handful of strong candidates will be selected and sent to a fine model for re-ranking. Extensive experimental results with two SOTA models for the fine re-ranking stage on standard benchmark datasets show that CrispSearch results in a speedup of up to 38 times over the SOTA fine methods with negligible performance degradation. Moreover, our method does not require millions of training instances, making it a pragmatic solution to on-device search and retrieval.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3649896

Reference42 articles.

1. Statista. 2022. Smartphone Unit Shipments by Price Category Worldwide from 2012 to 2022. Retrieved March 28 2022 from https://www.statista.com/statistics/934471/smartphone-shipments-by-price-category-worldwide/

2. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

3. Haoli Bai Lu Hou Lifeng Shang Xin Jiang Irwin King and Michael R. Lyu. 2022. Towards efficient post-training quantization of pre-trained language models. In Advances in Neural Information Processing Systems 35.1405–1418.

4. Max Bain Arsha Nagrani Gül Varol and Andrew Zisserman. 2022. A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022).

5. Lingjiao Chen Matei Zaharia and James Zou. 2020. FrugalML: How to use ML prediction APIs more accurately and cheaply. arXiv preprint:2006.07512 (2020).