FEBench: A Benchmark for Real-Time Relational Data Feature Extraction-Reference-Cited by-同舟云学术

FEBench: A Benchmark for Real-Time Relational Data Feature Extraction

Published:2023-08 Issue:12 Volume:16 Page:3597-3609
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zhou Xuanhe¹,Chen Cheng²,Li Kunyi¹,He Bingsheng³,Lu Mian²,Liu Qiaosheng²,Huang Wei²,Li Guoliang⁴,Zheng Zhao²,Chen Yuqiang²

Affiliation:

1. Tsinghua University

2. 4Paradigm Inc.

3. National Univ. of Singapore

4. Tsinghua University, Zhongguancun Laboratory

Abstract

As the use of online AI inference services rapidly expands in various applications (e.g., fraud detection in banking, product recommendation in e-commerce), real-time feature extraction (RTFE) systems have been developed to compute the requested features from incoming data tuples in ultra-low latency. Similar to relational databases, these RTFE procedures can be expressed using SQL-like languages. However, there is a lack of research on the workload characteristics and specialized benchmarks for RTFE, especially in comparison with existing database workloads and benchmarks (e.g., concurrent transactions in TPC-C). In this paper, we study the RTFE workload characteristics using over one hundred real datasets from open repositories (e.g. Kaggle, Tianchi, UCI ML, KiltHub) and those from 4Paradigm. The study highlights the significant differences between RTFE workloads and existing database benchmarks in terms of application scenarios, operator distributions, and query structures. Based on these findings, we propose to develop a realtime feature extraction benchmark named FEBench based on the four important criteria for a domain-specific benchmark proposed by Jim Gray. FEBench consists of selected representative datasets, query templates, and an online request simulator. We use FEBench to evaluate the effectiveness of feature extraction systems including OpenMLDB and Flink and find that each system exhibits distinct advantages and limitations in terms of overall latency, tail latency, and concurrency performance.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3611540.3611550

Reference60 articles.

1. https://archive.ics.uci.edu/ml/index.php. Last accessed on 2023-2. https://archive.ics.uci.edu/ml/index.php. Last accessed on 2023-2.

2. https://github.com/4paradigm/openmldb. Last accessed on 2023-2. https://github.com/4paradigm/openmldb. Last accessed on 2023-2.

3. https://github.com/akopytov/sysbench. Last accessed on 2023-2. https://github.com/akopytov/sysbench. Last accessed on 2023-2.

4. https://github.com/alibaba/feathub. Last accessed on 2023-2. https://github.com/alibaba/feathub. Last accessed on 2023-2.

5. https://github.com/feathr-ai/feathr. Last accessed on 2023-2. https://github.com/feathr-ai/feathr. Last accessed on 2023-2.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Biathlon: Harnessing Model Resilience for Accelerating ML Inference Pipelines;Proceedings of the VLDB Endowment;2024-06